这是用户在 2025-5-9 21:16 为 file:///C:/Users/Zee/Downloads/2211.10873v2.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Interpretable Scientific Discovery with Symbolic Regression: A Review
可解释科学发现与符号回归方法综述


Nour Makke Sanjay Chawla  努尔·马克 桑杰·乔拉

Qatar Computing Research Institute, HBKU, Doha
卡塔尔计算研究所,哈马德·本·哈利法大学,多哈

May 3, 2023  2023 年 5 月 3 日

Abstract  摘要


Symbolic regression is emerging as a promising machine learning method for learning succinct underlying interpretable mathematical expressions directly from data. Whereas it has been traditionally tackled with genetic programming, it has recently gained a growing interest in deep learning as a data-driven model discovery method, achieving significant advances in various application domains ranging from fundamental to applied sciences. In this survey, we present a structured and comprehensive overview of symbolic regression methods and discuss their strengths and limitations.
符号回归作为一种新兴的机器学习方法,正展现出从数据中直接学习简洁、可解释的底层数学表达式的潜力。传统上它通过遗传编程实现,但近年来在深度学习领域作为数据驱动的模型发现方法获得了越来越多的关注,并在从基础科学到应用科学的各种领域中取得了显著进展。本综述对符号回归方法进行了结构化、全面的概述,并讨论了其优势与局限性。


Contents  目录


1 Introduction 3  1 引言 3

2 Problem Definition 4  2 问题定义 4

2.1 Class of Function 4
2.1 函数类 4

2.2 Expression representation 4
2.2 表达式表示 4

3 Symbolic regression methods overview 5
3 符号回归方法概述 5

4 Linear symbolic regression 6
4 线性符号回归 6

4.1 Unidimensional case 7
4.1 一维情况 7

4.1.1 Univariate function [7
4.1.1 单变量函数 [7

4.1.2 Multivariate function 9
4.1.2 多变量函数 9

4.2 Multidimensional case 11
4.2 多维情况 11

5 Nonlinear symbolic regression 12
5 非线性符号回归 12

6 Tree expression 13  6 树表达式 13

6.1 Genetic Programming 13
6.1 遗传编程 13

6.2 Transformers 14  6.2 变换器 14

6.3 Reinforcement learning 16
6.3 强化学习 16

7 Applications 17  7 应用 17

8 Datasets 23  8 数据集 23

9 Discussion 24  9 讨论 24

10 Conclusion 25  10 结论 25

A Datasets Benchmarks Equations 29
A 数据集 基准测试 公式 29


1 Introduction  1 引言


Symbolic Regression (SR) is a rapidly growing subfield within machine learning (ML) to infer symbolic mathematical expressions from data [1, 2]. Interest in SR is being driven by the observation that it is not sufficient to only have accurate predictive models; however, it is often necessary that the learned models be interpretable [3]. A model is interpretable if the relationship between the input and output of the model can be logically or mathematically traced in a succinct manner. In other words, learnable models are interpretable if expressed as mathematical equations. As "disciplines" become increasingly data-rich and adopt ML techniques, the demand for interpretable models is likely to grow. For example, in the natural sciences (e.g., physics), mathematical models derived from first principles make it possible to reason about the underlying phenomenon in a way that is not possible with predictive models like deep neural networks. In critical disciplines like healthcare, non-interpretable models may never be allowed to be deployed - however accurate they maybe [4].
符号回归(Symbolic Regression, SR)是机器学习(ML)中一个快速发展的子领域,旨在从数据中推断符号化的数学表达式[1, 2]。对 SR 的兴趣源于以下观察:仅拥有高精度的预测模型是不够的,通常还需要所学模型具备可解释性[3]。若模型输入与输出之间的关系能以逻辑或数学方式简洁追溯,则该模型可被视为可解释。换言之,若以数学方程形式表达,可学习模型即具备可解释性。随着各"学科"数据日益丰富并采用 ML 技术,对可解释模型的需求可能持续增长。例如在自然科学(如物理学)中,基于第一性原理推导的数学模型能以深度神经网络等预测模型无法实现的方式解析底层现象。而在医疗保健等关键领域,无论非可解释模型精度多高,都可能永远无法获准部署[4]。

Example: Consider a data set consisting of samples (q1,q2,r,F) ,where q1 and q2 are the charges of two particles, r is the distance between them and F is the measured force between the particles. Assume q1,q2 , and r are the input variables,and F is the output variable. Suppose we model the input-output relationship as F=θ0+θ1q1+θ2q2+θ3r . Then,using the data set,we can infer the model’s parameters (θi) . The model will be interpretable because we will know the impact of each variable on the output. For example,if θ3 is negative,then that implies that as r increases,the force F will decrease. From physics,we know that this form of the model is unlikely to be accurate. On the other hand, we could model the relationship using a neural network F=NN(q1,q2,r,θ) . We expect the model to be highly accurate and predictive because neural networks (NNs) are universal function approximators. However, the model is uninterpretable because the input-output relationship is not easily apparent. The input feature vector subsequently undergoes several layers of nonlinear transformations,i.e., y=σ(iWiσ(jWjσ(kWkσ(Wx)))) ,where σ is a nonlinear activation function,and Widx are the learnable parameters of NN layer of index idx. Such models,called "blackbox",do not have an internal logic to let users understand how inputs are mathematically mapped to outputs. Explainability is the application of other methods to explain model predictions and to understand how it is learned. It refers to why the model makes the decision that way. What distinguishes explainability from interpretability is that interpretable models are transparent [3]. For example, the linear regression model predictions can be interpreted by evaluating the relative contribution of individual features to the predictions using their weights. An ideal SR model will return the relationship as kq1q2r2 ,which is the definition of the Coulomb force between two charged particles with a constant 1k=8.98×109 . However,learning the SR model is highly non-trivial as it involves searching over a large space of mathematical operations and identifying the right constant(k)that will fit the data. SR models can be directly inferred from data or can be used to "whitebox" a "blackbox" model such as a neural network.
示例:考虑一个由样本 (q1,q2,r,F) 组成的数据集,其中 q1q2 是两个粒子的电荷, r 是它们之间的距离, F 是测量到的粒子间作用力。假设 q1,q2r 为输入变量, F 为输出变量。若将输入-输出关系建模为 F=θ0+θ1q1+θ2q2+θ3r ,则可通过数据集推断模型参数 (θi) 。该模型具有可解释性,因为我们将了解每个变量对输出的影响。例如,若 θ3 为负值,则意味着随着 r 增大,作用力 F 会减小。从物理学角度看,这种模型形式可能不够精确。另一方面,我们也可以使用神经网络 F=NN(q1,q2,r,θ) 来建模该关系。由于神经网络(NNs)是通用函数逼近器,我们预期该模型具有高度准确性和预测能力。然而,该模型不可解释,因为输入-输出关系不易直观呈现。输入特征向量随后会经历多层非线性变换,即 y=σ(iWiσ(jWjσ(kWkσ(Wx)))) ,其中 σ 是非线性激活函数, Widx 是指数为 idx 的神经网络层可学习参数。 这类被称为“黑箱”的模型缺乏内部逻辑机制,无法让用户理解输入如何通过数学映射转化为输出。可解释性则是应用其他方法来阐明模型预测结果并理解其学习过程,核心在于揭示模型为何做出特定决策。与可解释性不同,可解释模型具有透明特性[3]——例如线性回归模型可通过特征权重评估各变量对预测结果的相对贡献度来实现解释。理想的符号回归(SR)模型会返回如 kq1q2r2 所示的关系式,该式实为带常数项 1k=8.98×109 的两个带电粒子间库仑力的定义式。然而,SR 模型的学习过程极具挑战性,需在庞大数学运算符空间中进行搜索,并确定能拟合数据的最优常数(k)。SR 模型既可直接从数据中推导获得,也可用于将神经网络等"黑箱"模型转化为"白箱"模型。

The ultimate goal of SR is to bridge data and observations following the Keplerian trial and error approach [5]. Kepler developed a data-driven model for planetary motion using the most accurate astronomical measurements of the era, which resulted in elliptic orbits described by a power law. In contrast, Newton developed a dynamic relationship between physical variables that described the underlying process at the origin of these elliptic orbits. Newton's approach [6] led to three laws of motion later verified by experimental observations. Whereas both methods fit the data well, Newton's approach could be generalized to predict behavior in regimes where no data were available. Although SR is regarded as a data-driven model discovery tool, it aims to find a symbolic model that simultaneously fits data well and could be generalized to uncovered regimes.
SR 的终极目标是遵循开普勒的试错法[5],在数据与观测之间架起桥梁。开普勒利用当时最精确的天文测量数据,为行星运动开发了一个数据驱动模型,最终得出由幂律描述的椭圆轨道。相比之下,牛顿则建立了物理变量之间的动力学关系,揭示了这些椭圆轨道背后的成因过程。牛顿的方法[6]后来被实验观测验证,形成了三大运动定律。尽管两种方法都能很好地拟合数据,但牛顿的方法能够推广到尚无数据的领域以预测行为。虽然 SR 被视为数据驱动的模型发现工具,但其目标是找到一个既能良好拟合数据、又能推广到未知领域的符号模型。

SR is deployed as an interpretable and predictive ML model or a data-driven scientific discovery method. SR was investigated as early as 1970 in research works [7, 8, 9] aiming to rediscover empirical laws. Such works iteratively apply a set of data-driven heuristics to formulate mathematical expressions. The first AI system meant to automate scientific discovery is called BACON [10, 11]. It was developed by Patrick Langley in the late 1970s and was successful in rediscovering versions of various physical laws, such as Coulomb's law and Galileo's laws for the pendulum and constant acceleration, among many others. SR was later studied by Koza [12, 13, 1] who proposed that genetic programming (GP) can be used to discover symbolic models by encoding mathematical expressions as computational trees, where GP is an evolutionary algorithm that iteratively evolves an initial population of individuals via biology-inspired operations. SR was since then tackled with GP-based methods [1, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]. Moreover, it was popularized as a data-driven scientific discovery tool with the commercial software Eureqa [26] based on a research work [2]. Whereas GP-based methods achieve high prediction accuracy, they do not scale to high dimensional data sets and are sensitive to hyperparameters [19]. More recently, SR has been addressed with deep learning-based methods 27. 28, 19, 29, 30, 31, 32, 33 which leverage neural networks (NNs) to learn accurate symbolic models. SR has been applied in fundamental and applied sciences such as astrophysics [34], chemistry [35, 36], materials science [37, 38], semantic similarity measurement [39], climatology 40, medicine [41], among many others. Many of these applications are promising, showing the potential of SR. A recent SR benchmarking platform SRBench is introduced by La Cava et al. [42]. It comprises 14 SR methods (among which ten are GP-based), applied on 252 data sets. The goal of SRBench was to provide a benchmark for rigorous evaluation and comparison of SR methods.
SR 作为一种可解释的预测性机器学习模型或数据驱动的科学发现方法被部署。早在 1970 年的研究工作中[7,8,9],SR 就被探索用于重新发现经验定律。这类工作通过迭代应用一系列数据驱动的启发式方法来构建数学表达式。首个旨在自动化科学发现的人工智能系统名为 BACON[10,11],由 Patrick Langley 于 1970 年代末开发,成功重新发现了包括库仑定律、伽利略摆锤定律和匀加速定律在内的多种物理定律版本。随后 Koza[12,13,1]研究了 SR,提出可通过将数学表达式编码为计算树,利用遗传编程(GP)发现符号模型——GP 是一种通过生物启发操作迭代进化初始个体群体的进化算法。此后 SR 主要通过基于 GP 的方法进行研究[1,14-25]。此外,基于研究工作[2]的商业软件 Eureqa[26]将其推广为数据驱动的科学发现工具。 基于高斯过程(GP)的方法虽然预测精度高,但无法扩展到高维数据集且对超参数敏感[19]。最近,符号回归(SR)开始采用基于深度学习的方法[27, 28, 19, 29, 30, 31, 32, 33],这些方法利用神经网络(NNs)学习精确的符号模型。SR 已广泛应用于基础科学与应用科学领域,如天体物理学[34]、化学[35, 36]、材料科学[37, 38]、语义相似性度量[39]、气候学[40]、医学[41]等。许多应用展现了 SR 的巨大潜力。La Cava 等人[42]近期推出了 SR 基准测试平台 SRBench,该平台包含 14 种 SR 方法(其中 10 种基于 GP),应用于 252 个数据集,旨在为 SR 方法的严格评估与比较提供基准。




1k is the electric force (or Coulomb) constant, k=8.9875517923×109kgm3s4A2 in SI base units.
1k 是电力(或库仑)常数, k=8.9875517923×109kgm3s4A2 以国际单位制基本单位表示。





This survey aims to help researchers effectively and comprehensively understand the SR problem and how it could be solved, as well as to present the current status of the advances made in this growing subfield. The survey is structured as follows. First, we define the SR problem, present a structured and comprehensive review of methods, and discuss their strengths and limitations. Furthermore, we discuss the adoption of these SR methods across various application domains and assess their effectiveness. Along with this survey, a living review [25] aims to group state-of-the-art SR methods and applications and track advances made in the SR field. The objective is to update this list often to incorporate new research works.
本综述旨在帮助研究者有效且全面地理解 SR 问题及其解决途径,并展示这一快速发展的子领域当前的研究进展。综述结构如下:首先,我们定义 SR 问题,对现有方法进行结构化、全面的梳理,并探讨其优势与局限。此外,我们分析了这些 SR 方法在不同应用领域的采用情况,评估其实际效果。配合本综述,动态文献[25]将持续汇总最前沿的 SR 方法与应用案例,追踪该领域进展,通过定期更新纳入最新研究成果。

This paper is organized as follows. The SR problem definition is presented in Section 2. We present an overview of methods deployed to solve the SR problem in Section 3, and the methods are discussed in detail in Section 4.5 and 6. Selected applications are described and discussed in Section 7. Section 8 presents an overview of existing benchmark data sets. Finally, we summarize our conclusions and discuss perspectives in Section 10.
本文组织结构如下:第 2 节阐述 SR 问题定义;第 3 节概述 SR 问题解决方法框架;第 4.5 节与第 6 节将详细解析具体方法;第 7 节选取典型应用场景进行论述;第 8 节汇总现有基准数据集;最后在第 10 节总结结论并展望未来研究方向。

2 Problem Definition  2 问题定义


The problem of symbolic regression can be defined in terms of classical Empirical Risk Minimization (ERM) [43].
符号回归问题可以依据经典的经验风险最小化(ERM)框架进行定义[43]。

Data: Given a data set D={(xi,yi)}i=1n ,where xiRd is the input vector and yiR is a scalar output.
数据:给定数据集 D={(xi,yi)}i=1n ,其中 xiRd 为输入向量, yiR 为标量输出。

Function Class: Let F be a function class consisting of mappings f:RdR .
函数类:设 F 为由映射 f:RdR 构成的函数类。

Loss Function: Define the loss function for every candidate fF :
损失函数:为每个候选函数 fF 定义损失函数:

(1)l(f):=i=1nl(f(xi),yi)

A common choice is the squared difference between the output and prediction,i.e. l(f)=i(yif(xi))2 .
常见的选择是输出与预测之间的平方差,即 l(f)=i(yif(xi))2

Optimization: The optimization task is to find the function(f)over the set of functions F that minimizes the loss function:
优化:优化任务是在函数集 F 上找到使损失函数最小化的函数(f)。

(2)f=argminfFl(f)

As stated below, what distinguishes SR from conventional regression problems is the discrete nature of the function class F . Different methods for solving the SR problem reduce to characterizing the function class.
如下所述,符号回归与传统回归问题的区别在于函数类 F 的离散性质。解决符号回归问题的不同方法归结为对函数类的刻画。

2.1 Class of Function  2.1 函数类


In SR,to define F ,we specify a library of elementary arithmetic operations and mathematical functions and variables,and an element fF is the set of all functions that can be obtained by function composition in the library [44]. For example, consider a library:
在符号回归(SR)中,为了定义 F ,我们指定了一个包含基本算术运算、数学函数及变量的库,而 fF 则是通过该库中函数组合所能获得的所有函数的集合[44]。例如,考虑如下库:

(3)L={id(),add(,),sub(,),mul(,),+1,1}

Then the set of all of the polynomials (in one variable x ) with integer coefficients can be derived from L using function composition.
那么所有具有整数系数的(单变量 x )多项式集合都可以通过函数复合从 L 导出。

2.2 Expression representation
2.2 表达式表示法


It is convenient to express symbolic expressions in a sequential form using either a unary-binary expression tree or the polish notation [45]. For example,the expression f(x)=x1x22x3 can be derived using function composition from L (Eq. 3) and represented as a tree-like structure illustrated in Figure 1a. By traversing the (binary) tree top to bottom and left to right in a depth-first manner, we can represent the same expression as a unique sequence called the polish form, as illustrated in Figure 1b.
为便于表达符号式,可采用一元二元表达式树或波兰表示法[45]将其转换为序列形式。例如,表达式 f(x)=x1x22x3 可通过函数组合从 L (式 3)派生而来,并以图 1a 所示的树状结构呈现。通过自上而下、从左至右深度优先遍历该(二叉)树,可将同一表达式表示为唯一的序列形式,称为波兰式,如图 1b 所示。




https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_4.jpg?x=565&y=144&w=518&h=470&r=0

Figure 1: (a) Example of a unary-binary tree that encodes f(x)=x1x22x3 . (b) Sequence representation of the tree-like structure of f(x) .
图 1:(a) 编码 f(x)=x1x22x3 的一元二元树示例。(b) f(x) 树状结构的序列表示形式。


In practice,the library L includes many other common elementary mathematical functions,including the basic trigonometric functions like sine, cosine, logarithm, exponential, square root, power low, etc. Prior domain knowledge is advantageous for library definition because it reduces the search space to only include the most relevant mathematical operations to the studied problem. Furthermore, a large range of possible numeric constants should be possible to express. For example, numbers in base-10 floating point notation rounded up to four significant digits can be represented as triple of (sign, mantissa, exponent) [31]. The function sin(3.456x), for example,can be represented as [sin, mul,3456,E3,x] .
实际上,该库 L 包含了许多其他常见的初等数学函数,包括基本三角函数如正弦、余弦、对数、指数、平方根、幂律等。先验领域知识对于库的定义是有利的,因为它将搜索空间缩小到仅包含与研究问题最相关的数学运算。此外,应能表达大范围的可能的数值常量。例如,以十进制浮点表示法四舍五入至四位有效数字的数字可以表示为(符号、尾数、指数)的三元组[31]。例如,函数 sin(3.456x)可以表示为 [sin, mul,3456,E3,x]

3 Symbolic regression methods overview
3 符号回归方法概述


In this survey, we categorize SR methods in the following manner: regression-based methods, expression tree-based methods, physics-inspired and mathematics-inspired methods, as presented in Figure 2. For each category, a summary of the mathematical tool, the expression form, the set of unknowns, and the search space, is presented in Table 1.
在本综述中,我们将符号回归方法分类如下:基于回归的方法、基于表达式树的方法、物理启发和数学启发的方法,如图 2 所示。每个类别的数学工具、表达式形式、未知数集合和搜索空间的总结见表 1。

The linear method defines the functional form as a linear combination of nonlinear functions of x that are comprised in the predefined library L . Linear models are expressed as:
线性方法将函数形式定义为由预定义库 L 中包含的 x 非线性函数的线性组合。线性模型表示为:

(4)f(x,θ)=jθjhj(x)

where j spans the base functions of L . The optimization problem reduces to find the set of parameters {θ} that minimizes the loss function defined over a continuous parameter space Θ=RM as follows:
其中 j 遍历 L 的基函数。优化问题简化为寻找参数集 {θ} ,以最小化在连续参数空间 Θ=RM 上定义的损失函数,如下所示:

(5)θ=argminθΘil(f(xi,θ),yi)

This method is advantageous for being deterministic and disadvantageous because it imposes a single model structure which is fixed during training when the model's parameters are learned.
该方法的优势在于其确定性,劣势则在于它强制采用单一模型结构,在训练学习模型参数时该结构保持固定不变。

The nonlinear method defines the model structure by a neural network. Nonlinear models can thus be expressed as:
非线性方法通过神经网络定义模型结构。因此,非线性模型可表示为:

(6)f(x,W)=σ(iWiσ(jWjσ(Wx))))

where σ is a nonlinear activation function,and Widx are the learnable parameters of NN layer of index idx . Similarly to the linear method,the optimization problem reduces finding the set of parameters {W,b} of neural network layers, which minimizes the loss function over the space of real values.
其中 σ 是一个非线性激活函数, Widx 是索引为 idx 的神经网络层的可学习参数。与线性方法类似,优化问题简化为寻找神经网络层的参数集 {W,b} ,该参数集在实数空间上最小化损失函数。

Expression tree-based methods treat mathematical expressions as unary-binary trees whose internal nodes are operators and terminals are operands (variables or constants). This category comprises GP-based, deep neural transformers, and reinforcement learning-based methods. In GP-based methods, a set of transition rules (e.g., mutation, crossover) is defined over the tree space and applied to an initial population of trees throughout many iterations until the loss function is minimized. Transformers [46] represent a novel architecture of neural network (encoder and decoder) that uses attention mechanism. The latter was primarily used to capture long-range dependencies in a sentence. Transformers were designed to operate on sequential data and to perform sequence-to-sequence (seq2seq) tasks. For their use in SR,input data points(x,y)and symbolic expressions (f)are encoded as sequences and transformers perform set-to-sequence tasks. The unknowns are the weight parameters of the encoder and the decoder. Reinforcement learning (RL) is a machine learning method that seeks to learn a policy π(xθ) by training an agent to perform a task by interacting with its environment in discrete time steps. An RL setting requires four components: state space, action space, state transition probabilities, and reward. The agent selects an action that is sent to the environment. A reward and a new state are sent back to the agent from its environment and used by the agent to improve its policy at the next time step. In the context of SR, symbolic expression (sequence) represents a state, predicting an element in a sequence represents an action, the parent and sibling represent the environment, and the reward is commonly chosen as the mean square error (MSE). RL-based SR methods are commonly hybrid and use various ML tools (e.g., NN, RNN, etc.) in a joint manner with RL.
基于表达式树的方法将数学表达式视为一元-二元树,其内部节点为运算符,叶节点为操作数(变量或常量)。这类方法包括基于遗传编程(GP)、深度神经变换器和强化学习的方法。在基于 GP 的方法中,定义了一组在树空间上的转换规则(如变异、交叉),并在多次迭代中应用于初始树种群,直到损失函数最小化。变换器[46]代表了一种新颖的神经网络架构(编码器和解码器),它使用注意力机制。后者最初用于捕捉句子中的长距离依赖关系。变换器设计用于处理序列数据并执行序列到序列(seq2seq)任务。在符号回归(SR)中的应用中,输入数据点(x,y)和符号表达式(f)被编码为序列,变换器执行集合到序列的任务。未知的是编码器和解码器的权重参数。 强化学习(RL)是一种机器学习方法,旨在通过训练智能体在离散时间步中与环境交互来完成任务,从而学习策略 π(xθ) 。RL 设定需要四个组成部分:状态空间、动作空间、状态转移概率和奖励。智能体选择一个动作发送给环境,环境则返回奖励和新状态给智能体,用于在下一时间步改进其策略。在符号回归(SR)的上下文中,符号表达式(序列)代表状态,预测序列中的元素代表动作,父节点和兄弟节点代表环境,而奖励通常选择均方误差(MSE)。基于 RL 的 SR 方法通常是混合型的,并联合使用多种机器学习工具(如神经网络 NN、循环神经网络 RNN 等)与 RL 协同工作。




https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_5.jpg?x=331&y=141&w=988&h=762&r=0

Figure 2: Taxonomy based on the type of symbolic regression methods. ϕ denotes a neural network function, W denotes the set of learnable parameters in NN. x denotes the input data, z denotes a reduced representation of x ,and x denotes a new representation of x ,e.g.,by defining new features based on the original ones. T represents the final population of selected expression trees in genetic programming.
图 2:基于符号回归方法类型的分类体系。 ϕ 表示神经网络函数, W 表示神经网络中的可学习参数集合, x 表示输入数据, z 表示 x 的简化表示, x 表示 x 的新表示(例如通过基于原始特征定义新特征), T 代表遗传编程中最终选定的表达式树种群。


4 Linear symbolic regression
4 线性符号回归


The linear approach assumes,by definition,that the target symbolic expression (f(x)) is a linear combination of nonlinear functions of feature attributes:
线性方法默认假设目标符号表达式 (f(x)) 是特征属性非线性函数的线性组合:

(7)f(x)=jθjhj(x)

Here x denotes the input features vector, θj denotes a weight coefficient,and hj() denotes a unary operator of the library L . This approach predefines the model’s structure and reduces the SR problem to learn only the model’s parameters by solving a system of linear equations. The particular case where f(x) is a linear combination of degree-one monomial reduces to a conventional linear regression problem,i.e., f(x)=jθjxj= θ0+θ1x+θ2x2+ . There exist two cases for this problem: (1) a unidimensional case defined by f:RdR ; and (2) a multidimensional case defined by f:RdRm ,with d the number of input features and m the number of variables required for a complete description of a system; for example, the Lorenz system for fluid flow is defined in terms of three physical variables which depend on time.
此处 x 表示输入特征向量, θj 表示权重系数, hj() 表示库 L 中的一元运算符。该方法预定义了模型结构,并将符号回归问题简化为仅通过求解线性方程组学习模型参数。当 f(x) 为一次单项式的线性组合时,特例退化为常规线性回归问题,即 f(x)=jθjxj= θ0+θ1x+θ2x2+ 。该问题存在两种情况:(1) 由 f:RdR 定义的单维情形;(2) 由 f:RdRm 定义的多维情形,其中 d 表示输入特征数量, m 表示完整描述系统所需的变量数;例如,描述流体流动的洛伦兹系统就以三个随时间变化的物理变量定义。



Table 1: Table summarizing symbolic regression methods. The mathematical tool, the expression form, the set of unknowns, and the search space are specified for each method. Set2seq abbreviates "set-to-sequence".
表 1:符号回归方法汇总表。每种方法均标注了数学工具、表达式形式、未知数集合及搜索空间。Set2seq 为"集合到序列"的缩写。

Method  方法Tool  工具Expression form  表达式形式Unkown  未知Search space  搜索空间
Linear SR  线性 SRUni-D linear system  单维线性系统y=iθifi(x){θ}iR
Multi-D linear system  多维线性系统yi=jθjfj(x)({θ}i)jR
Nonlinear SR  非线性 SRNeural Network  神经网络y=f(Wx+b){W,b}R
Expression-tree search  表达式树搜索Genetic Programming  遗传编程Expression tree  表达式树trees  树结构
Transformers  变压器set2seq mapping  集合到序列映射{Wq,Wk,Wv}R
Reinforcement learning  强化学习set2seq mapping  集合到序列映射π(θ)R
Physics-inspired  物理启发式AI-Feynman  AI-费曼y=f(x,θ)--
Mathematics-inspired  数学启发的Symbolic metamodels  符号化元模型G(x,θ)θR


4.1 Unidimensional case  4.1 单维情况


Given a data set D={(xi,yi)}i=1n ,the mathematical expression could be either univariate (xiR,yi=f(xi)) or multivariate (xiRd,yi=f(xi)) . The methodology of linear SR is presented in detail for the univariate case in Secion 4.1.1 for simplicity and is extended for the multivariate case in Section 4.1.2
给定数据集 D={(xi,yi)}i=1n ,数学表达式可以是单变量 (xiR,yi=f(xi)) 或多变量 (xiRd,yi=f(xi)) 。为简化起见,4.1.1 节详细介绍了单变量情况下的线性符号回归方法,并在 4.1.2 节中扩展至多变量情形。

4.1.1 Univariate function
4.1.1 单变量函数


Data set: D={xiR;yi=f(xi)} .  数据集: D={xiR;yi=f(xi)}

Library: L can include any number of mathematical operators such that the dimension of the data set is always greater than the dimension of the library matrix (see discussion below).
库: L 可包含任意数量的数学运算符,确保数据集的维度始终大于库矩阵的维度(详见下文讨论)。

In this approach,a coefficient θj is assigned to each candidate function (fj()L) as an activeness criterion such that:
在此方法中,为每个候选函数 (fj()L) 分配一个系数 θj 作为活跃性准则,具体表现为:

(8)y=jθjfj(x)

Applying Eq. 8 to input-output pairs (xi,yi) yields a system of linear equations as follows:
将方程 8 应用于输入-输出对 (xi,yi) ,可得到如下线性方程组:

y1=θ0+θ1f1(x1)+θ2f2(x1)++θkfk(x1)

(9)y2=θ0+θ1f1(x2)+θ2f2(x2)++θkfk(x2)

yn=θ0+θ1f1(xn)+θ2f2(xn)++θkfk(xn)

which can be represented in a matrix form as:
这可以用矩阵形式表示为:

(10)[y1y2yn]=[1f1(x1)f2(x1)fk(x1)1f1(x2)f2(x2)fk(x2)1f1(xn)f2(xn)fk(xn)][θ0θ1θk]

Equation 10 can then be presented in a compact form:
随后,方程 10 可以简洁地表示为:

(11)Y=U(X)Θ


where ΘR(k+1) is the sparse vector of coefficients,and URn×(k+1) is the library matrix which can be represented as a function of the input vector X as follows:
其中 ΘR(k+1) 为稀疏系数向量, URn×(k+1) 是库矩阵,可表示为输入向量 X 的函数,具体如下:

(12)U(X)=[1f1(X)f2(X)fk(X)]

Example: For a library defined as:
示例:对于一个定义为以下的库:

(13)L={1,x,()2,sin(),cos(),exp()}

The matrix U becomes:  矩阵 U 变为:

U(X)=[1XX2sin(X)cos(X)exp(X)]

Each row (of index i ) in Eq. 12 is a vector of (k+1) functions of xi . The vector of coefficients,i.e.,the model’s parameters, is obtained by solving Eq. 11 as follows 2 :
方程 12 中的每一行(索引 i )是一个由 (k+1) 个关于 xi 的函数组成的向量。系数向量(即模型参数)通过求解方程 11 获得,具体如下 2

(14)Θ=(UTU)1UTY

The magnitude of a coefficient θk effectively measures the size of the contribution of the associated function fk() to the final prediction. Finally,the prediction vector Y^ can be evaluated using Eq. 11,
系数 θk 的幅值实质上衡量了关联函数 fk() 对最终预测的贡献大小。最终,预测向量 Y^ 可通过方程 11 计算得出,

An exemplary schematic is illustrated in Figure 3 for the univariate function f(x)=1+αx3 . Only coefficients associated with functions {1,x3} of the library are non-zero,with values equal to 1 and α ,respectively.
图 3 展示了一个单变量函数 f(x)=1+αx3 的示例示意图。仅与函数库中 {1,x3} 相关联的系数为非零值,其数值分别为 1 和 α



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_7.jpg?x=649&y=1142&w=350&h=211&r=0

Figure 3: Schematic of the system of linear equations of Eq. 11 for f(x)=1+αx3 . A library matrix U(X) of nonlinear functions of the input is constructed,where L={1,x,x2,x3,} . The marked entries in the Θ vector denote the non-zero coefficients determining which functions of the library are active.
图 3:针对 f(x)=1+αx3 的方程 11 线性方程组示意图。构建了一个输入非线性函数的库矩阵 U(X) ,其中 L={1,x,x2,x3,}Θ 向量中标记的条目表示决定库中哪些函数被激活的非零系数。


In the following, linear SR is tested on synthetic data. In each experiment, training and test data sets are generated. Each set consists of twenty data points randomly sampled from a uniform distribution U(1,1) , and y is evaluated using a univariate function,i.e., D={(xi,f(xi))}i=1n . Two libraries are considered in these experiments: L1={x,()2,()3,,()9} and L2=L1{sin(),cos(),tan(),exp(),sigmoid()} . The results are reported in terms of the output expression (Equation 7) and the coefficient of determination R2 . SR problems are grouped into (i) pure polynomial functions and (ii) mixed polynomial and trigonometric functions. In each experiment, parameters are learned using the training data set, and results are reported for the test data set in
在以下实验中,线性符号回归(SR)在合成数据上进行了测试。每组实验都会生成训练数据集和测试数据集。每个数据集包含从均匀分布 U(1,1) 中随机采样的二十个数据点,并使用单变量函数(即 D={(xi,f(xi))}i=1n )对 y 进行评估。实验中考虑了两个库: L1={x,()2,()3,,()9}L2=L1{sin(),cos(),tan(),exp(),sigmoid()} 。结果以输出表达式(公式 7)和决定系数 R2 的形式呈现。符号回归问题分为两类:(i)纯多项式函数和(ii)混合多项式与三角函数。每组实验中,参数通过训练数据集学习,并在测试数据集上报告结果。

Table 2  表 2


For polynomial functions,an exact output is obtained using L1 with an R2=1.0 ,whereas only approximate output is obtained using L2 . In the latter case,the quality of the fit depends on the size of the training data set. An exemplary result is shown in Figure 4 for f(x)=x+x2+x3 . Points represent the (test) data of the input file,i.e.,X; the red curve represents f(x) as a function of x ,and the blue and black dashed curves represent the predicted function f^(x) obtained using L1 and L2 respectively. An exact match between the ground-truth function and the predicted one is found using L1 ,whereas a significant discrepancy is obtained using L2 . This discrepancy could be explained by the fact that various functions in L2 exhibit the same x -dependence over the covered x -range.
对于多项式函数,使用 L1 配合 R2=1.0 可获得精确输出,而使用 L2 仅能得到近似结果。后一种情况下,拟合质量取决于训练数据集的大小。图 4 展示了 f(x)=x+x2+x3 的典型结果:点代表输入文件中的(测试)数据,即 X;红色曲线表示作为 x 函数的 f(x) ,蓝色和黑色虚线则分别代表通过 L1L2 得到的预测函数 f^(x) 。使用 L1 时,真实函数与预测函数完全吻合,而使用 L2 则存在显著差异。这种差异可解释为:在 L2 中,多种函数在覆盖的 x 范围内表现出相同的 x 依赖性。




2 Technically the pseudo-inverse, U+
2 技术上称为伪逆, U+






Table 2: Results of linear SR in the case of univariate functions. D={(xi;yi)};xiU(1,1,20) and yi=f(xi) . L1={x,()2,,()9} and L2=L1{sin(),cos(),tan(),exp(),sigmoid()} . T denotes True,and F denotes False.
表 2:单变量函数线性符号回归结果。 D={(xi;yi)};xiU(1,1,20)yi=f(xi)L1={x,()2,,()9}L2=L1{sin(),cos(),tan(),exp(),sigmoid()} 。T 表示真, F 表示假。

Benchmark  基准测试Expression  表达L1L2
Exp  表达式R2Exp  表达式R2
Nguyen-2  阮氏-2x4+x3+x2+xT1.0F0.886
Nguyen-3  阮氏-3x5+x4+x3+x2+xT1.0F0.867
Livermore-21  利弗莫尔-21x8+x7+x6+x5+x4+x3+x2+xT1.0F0.869
Livermore-9  利弗莫尔-9x9+x8+x7+x6+x5+x4+x3+x2+xT1.0F0.882
Livermore-6  利弗莫尔-6x+2x2+3x3+4x4T1.0F0.417
Livermore-19  利弗莫尔-19x+x2+x4+x5T1.0F-0.079
Livermore-14*  利弗莫尔-14*x+x2+x3+sin(x)F1.0F-0.857
Nguyen-5  阮-5sin(x2)cos(x)1F0.999F-3.97
Nguyen-6  阮-6sin(x)+sin(x+x2)F0.999F0.564


https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_8.jpg?x=463&y=798&w=714&h=465&r=0

Figure 4: Result of linear SR for the Nguyen-1 benchmark,i.e., f(x)=x+x2+x3 . Red points represent (test) data set. The red curve represents the true function. The blue and black dashed curves represent the learned functions using L1 and L2 ,respectively.
图 4:Nguyen-1 基准测试的线性 SR 结果,即 f(x)=x+x2+x3 。红点代表(测试)数据集。红色曲线表示真实函数。蓝色和黑色虚线曲线分别表示使用 L1L2 学习到的函数。


For mixed polynomial and trigonometric expressions, both library choices do not produce the exact expression. However,a better R2 -coefficient is obtained using L1 . In the case of Nguyen-5 benchmark for example,i.e., f(x)=sin(x2)cos(x)1 ,the resulting function is the Taylor expansion of f :
对于混合多项式和三角表达式,两种库选择均未生成精确表达式。然而,使用 L1 可获得更优的 R2 系数。以 Nguyen-5 基准测试为例,即 f(x)=sin(x2)cos(x)1 ,所得函数为 f 的泰勒展开式:

y^(x)1+0.9x20.5x40.13x6+O(x8)

In conclusion, this approach can not learn the ground-truth function when the latter is a multiplication of two functions (i.e., f(x)=f1(x)f2(x) ) or when it has a multiplicative or an additive factor to the variable (e.g., sin(α+x),exp(λx) ,etc.). In the best case,it outputs an approximation of the ground-truth function. Furthermore, this approach fails to predict the correct mathematical expression when the library is extended to include a mixture of polynomial, trigonometric, exponential, and logarithmic functions.
综上所述,当真实函数为两个函数的乘积(即 f(x)=f1(x)f2(x) )或对变量具有乘法或加法因子时(例如 sin(α+x),exp(λx) 等),该方法无法学习到真实函数。在最佳情况下,它仅能输出真实函数的一个近似。此外,当函数库扩展至包含多项式、三角函数、指数函数和对数函数的混合时,该方法无法预测出正确的数学表达式。

4.1.2 Multivariate function
4.1.2 多元函数


For a given data set D={xiRd;yi=f(x1,,xd)} ,where d is the number of features,the same equations presented in Section 4.1.1 are applicable. However, the dimension of the library matrix U changes to consider the features vector dimension. For example, for the same library shown in Eq. 13 and a two dimensional features
对于给定的数据集 D={xiRd;yi=f(x1,,xd)} ,其中 d 表示特征数量,第 4.1.1 节中提出的方程同样适用。然而,库矩阵 U 的维度需调整为考虑特征向量的维度。例如,对于式 13 所示的相同库及二维特征


vector,i.e., XR2,U(X) becomes:
向量,即 XR2,U(X) 变为:

(15)U(X)=[|||||1XXP2sin(X)cos(X)exp(X)|||||]

=[1x1x2x12x1x2x22sin(x1)sin(x2)]

Here, XPq denotes polynomials in X of the order q .
此处, XPq 表示 X 中阶数为 q 的多项式。

Table 3 presents the results of the experiments performed on two-variables dependent functions,i.e., f(x1,x2) . Similarly to Section 4.1.1, training and test data sets are generated by randomly sampling twenty pairs of points (x1,x2) from a uniform distribution U(1,1) such that D={(x1i,x2i,f(x1i,x2i))}i=1n . The same choices for the library are considered: L1={x,()2,,()9} and L2=L1{sin(),cos(),tan(),exp(),sigmoid()} . An exact match between the ground-truth and predicted function is obtained using L1 for any polynomial function, whereas only approximate solutions are obtained for trigonometric functions. The results are approximate of the ground-truth function using L2 .
表 3 展示了针对双变量依赖函数(即 f(x1,x2) )的实验结果。与第 4.1.1 节类似,训练和测试数据集通过从均匀分布 U(1,1) 中随机采样二十对点 (x1,x2) 生成,且满足 D={(x1i,x2i,f(x1i,x2i))}i=1n 。库的选择保持不变: L1={x,()2,,()9}L2=L1{sin(),cos(),tan(),exp(),sigmoid()} 。对于任何多项式函数,使用 L1 可得到真实函数与预测函数的精确匹配,而三角函数仅能获得近似解。结果通过 L2 对真实函数进行了近似。

Table 3: Results for multivariate functions using linear SR. D={(x1,x2)U(1,1,20);y=f(x1,x2)} . L1={x,()2,,()9} and L2=L1{sin(),cos(),tan(),exp(),sigmoid()} . T and F refer to True and False.
表 3:使用线性符号回归处理多元函数的结果。 D={(x1,x2)U(1,1,20);y=f(x1,x2)}L1={x,()2,,()9}L2=L1{sin(),cos(),tan(),exp(),sigmoid()} 。T 和 F 分别代表真与假。


Benchmark  基准测试Expression  表达L1L2
Result  结果R2Result  结果R2
Nguyen-12  阮氏-12x14x13+12x22x2T1.0F1
Livermore-5  利弗莫尔-5x14x13+x12x2T1.0F1
Nguyen-9  阮氏-9sin(x1)+sin(x22)F1F1
Nguyen-10  阮氏-102sin(x1)cos(x2)F1F1


Furthermore, linear SR is tested on a dataset generated using a two-dimensional multivariate normal distribution N(μ,) ,as shown in Fig. 5 Different analytic expressions for f(x1,x2) were tested with different library bases that are summarized in Table 4, including pure polynomial basis functions, polynomial and trigonometric basis functions, and a mixed library.
此外,线性 SR 在由二维多元正态分布 N(μ,) 生成的数据集上进行了测试,如图 5 所示。针对 f(x1,x2) 的不同解析表达式,测试了多种库基函数,总结于表 4 中,包括纯多项式基函数、多项式与三角基函数的组合以及混合库。



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_9.jpg?x=493&y=1391&w=666&h=439&r=0

Figure 5: Two-dimensional multivariate normal distribution used in test applications.
图 5:测试应用中使用的二维多元正态分布。


The function y1=cos(x1)+sin(x2) is explored with all three bases. In the case of a pure polynomial basis,the correct terms of the Taylor expansion of both cos(x1) and sin(x2) are identified with only approximate values of their coefficients,i.e., y^1=(0.880.3x12+0.01x14)+(0.970.2x23) ,which is reflected in the significantly high reconstruction error of the order of 30% . In both bases where trigonometric functions are enlisted,the correct terms cos(x1) and sin(x2) are identified with an excellent reconstruction error,that is 107 . Note that the lowest reconstruction error is obtained for the library U2, which has the least number of operations and, consequently, the lowest number of coefficients.
函数 y1=cos(x1)+sin(x2) 在三种基函数下均被探索。对于纯多项式基的情况,泰勒展开的正确项(包括 cos(x1)sin(x2) )被识别出,但其系数仅为近似值(即 y^1=(0.880.3x12+0.01x14)+(0.970.2x23) ),这反映在高达 30% 量级的重构误差上。而在包含三角函数的两种基中,正确项 cos(x1)sin(x2) 被准确识别,且重构误差极低( 107 )。值得注意的是,库 U2 因操作次数最少、系数数量最低,获得了最小的重构误差。



Name  名称List of functions  功能列表
Library  U11,X,XP2,XP3,XP4
U21,X,XP2,cos(X),sin(X)
U31,X,XP2,cos(X),sin(X),tan(X),exp(X) ,sigmoid(X)


The function y2=x12+cos(x2) is also tested. For the pure polynomial basis,the reconstructed function y^2=x12+(0.83+0.49x2x22) predicts approximate values with a reconstruction error of 1% . An excellent prediction is made for the other bases,which enlist both operations in y2(x1,x2) .
函数 y2=x12+cos(x2) 同样经过了测试。对于纯多项式基,重构函数 y^2=x12+(0.83+0.49x2x22)1% 的重构误差预测近似值。在其他基函数中表现优异,这些基函数同时包含了 y2(x1,x2) 中的运算。

In the same exercise,a more complicated function form is tested that includes mixed terms,i.e., y3=x1(1+ x2)+cos(x1)sin(x2) . The difference between the true and the predicted function is illustrated in Fig. 6 . The linear approach performs similarly for all three library bases. A low reconstruction error is obtained because the operation term cos(x1)sin(x2) in y3 is not enlisted in any of the libraries,showing an important limitation of the current approach.
在同一实验中,测试了一个包含混合项(即 y3=x1(1+ x2)+cos(x1)sin(x2) )的更复杂函数形式。真实函数与预测函数之间的差异如图 6 所示。线性方法在所有三个库基函数中表现相似。由于操作项 cos(x1)sin(x2) 未包含在任何库的 y3 中,重构误差较低,这显示了当前方法的一个重要局限性。



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_10.jpg?x=220&y=822&w=1208&h=514&r=0

Figure 6: Difference between true(y)and predicted (y^) values of the function y=x1(1+x2)+cos(x1)sin(x2) , for the three libraries defined in Table 4: U1 (left), U2 (center), U3 (right).
图 6:函数 y=x1(1+x2)+cos(x1)sin(x2) 的真实值(y)与预测值 (y^) 之间的差异,针对表 4 中定义的三个库:U1(左)、U2(中)、U3(右)。


4.2 Multidimensional case
4.2 多维情况


The target mathematical expression comprises m components,i.e., Y=[y1,,ym] ,and the goal is to learn the coefficients of a system of linear equations rather than one mathematical expression. Each component (yj) is described by:
目标数学表达式由 m 个分量组成,即 Y=[y1,,ym] ,其目标是学习线性方程组的系数而非单一数学表达式。每个分量 (yj) 描述如下:

(16)yj=fj(x)=kθjkhk(x)

In this case,there exist m sparse vectors of coefficients,i.e., Θ=[θ1θm] . Consider the Lorenz system, which is a set of ordinary differential equations that captures nonlinearities in the dynamics of fluid convection. It consists of three variables {x1,x2,x3} and their first-order derivatives with respect to time {dx1dt,dx2dt,dx3dt} , which we will refer to as {y1,y2,y3} . Using the library of Eq. 13,the system of linear equations is represented in a matrix form as follows:
在这种情况下,存在 m 稀疏系数向量,即 Θ=[θ1θm] 。考虑洛伦兹系统,这是一组捕捉流体对流动力学中非线性现象的常微分方程。它包含三个变量 {x1,x2,x3} 及其对时间的一阶导数 {dx1dt,dx2dt,dx3dt} ,我们将这些导数称为 {y1,y2,y3} 。利用式 13 的库,该线性方程组以矩阵形式表示如下:

(17)[y1y2y3]=[1x1x2x12x1x2x22exp(x2)][θ1θ2θ3]


Here, YRn×3,U(X)Rn×k and ΘRk×3 ,where n is the size of the input data and k is the number of columns in the library matrix U. The jth -component of the Y vector is given by:
这里, YRn×3,U(X)Rn×kΘRk×3 ,其中 n 表示输入数据的大小, k 表示库矩阵 U 的列数。 Y 向量的 jth 分量由下式给出:

(18)yj=θj,0+θj,1x1+θj,2x2+θj,3x12++θj,kexp(x2)

Equation 17 can be written in a compact form as:
式 17 可以简写为:

(19)yk=U(xT)θk

The application presented in [33] uses this approach, where the authors aim to learn differential equations that govern the dynamics of a given system, such as a nonlinear pendulum and the Lorenz system. The approach successfully learned the exact weights, allowing them to recover the correct governing equations.
文献[33]中展示的应用采用了这种方法,作者旨在学习支配给定系统(如非线性摆和洛伦兹系统)动力学的微分方程。该方法成功学习了精确权重,使他们能够恢复正确的控制方程。

An exemplary schematic is illustrated in Figure 7 for the Lorenz system defined by x˙=σ(yx),y˙=x(ρz)y , z˙=xyβz . Here x,y ,and z are physical variables and x˙,y˙ ,and z˙ are their respective time-derivatives. Only coefficients associated with functions {x1,x1x2,} should be non-zero and equal to the factors shown in the Lorenz system's set of equations.
示例示意图如图 7 所示,展示由 x˙=σ(yx),y˙=x(ρz)yz˙=xyβz 定义的洛伦兹系统。其中 x,yz 为物理变量, x˙,y˙z˙ 分别为其时间导数。仅与函数 {x1,x1x2,} 相关的系数应为非零且等于洛伦兹方程组中所示的因子。



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_11.jpg?x=556&y=815&w=534&h=211&r=0

Figure 7: Schematic of the system of Eq. 11 for the Lorenz system defined by y1=σ(x2x1),y2=x1(ρx3)x2 , y3=x1x2βx3 . A library U(X) of nonlinear functions of the input is constructed. The marked entries in the θ s vectors denote the non-zero coefficients determining which library functions are active for each of the three variables {y1,y2,y3} .
图 7:式 11 定义的洛伦兹系统示意图。构建了一个输入非线性函数库 U(X)θ 向量中标示的条目表示非零系数,这些系数决定了三个变量 {y1,y2,y3} 中哪些库函数处于激活状态。


In summary, the linear approach is only successful in particular cases and can not be generalized. Its main limitation is in predefining the model's structure as a linear combination of nonlinear functions, reducing the SR problem to solve a system of linear equations. In contrast, the main mission of SR is to learn the model's structure and parameters. A direct consequence of this limitation is that the linear approach fails to learn expressions in many cases: (i) composition of functions (e.g., f(x)=f1(x)f2(x) ); (ii) multivariate functions (e.g., exp(xy),tan(x+y) ,etc.); and (iii) functions including multiplicative or additive factors to their arguments (e.g., exp(λx) ). Finally the dimension of the library matrix can be challenging in computing resources for extended libraries and high-dimensional data sets.
综上所述,线性方法仅在特定情况下适用,无法推广。其主要局限在于预先将模型结构定义为非线性函数的线性组合,将符号回归问题简化为求解线性方程组。而符号回归的核心任务是学习模型的结构与参数。这一局限直接导致线性方法在多数情况下无法学习以下表达式:(i) 函数的组合(如 f(x)=f1(x)f2(x) );(ii) 多元函数(如 exp(xy),tan(x+y) 等);(iii) 参数包含乘性或加性因子的函数(例如 exp(λx) )。此外,对于扩展库和高维数据集,库矩阵的维度可能对计算资源构成挑战。

5 Nonlinear symbolic regression
5 非线性符号回归


The nonlinear method uses deep neural networks (DNN), known for their great ability to detect and learn complex patterns directly from data.
非线性方法采用深度神经网络(DNN),该技术以直接从数据中检测和学习复杂模式的能力著称。

DNN has the advantage of being fully differentiable in its free parameters allowing end-to-end training using back-propagation. This approach searches the target expression by replacing the standard activation functions in a neural network with elementary mathematical operations. Figure 8 shows an NN-based architecture for SR called the Equation Learner (EQL) network proposed by Martius and Lampert [47] in comparison with a standard NN. Only two hidden layers are shown for simple visualization, but the network's deepness is controlled as per the case study.
DNN 的优势在于其自由参数完全可微分,允许通过反向传播进行端到端训练。该方法通过将神经网络中的标准激活函数替换为基本数学运算来搜索目标表达式。图 8 展示了一种基于神经网络的符号回归架构,称为方程学习器(EQL)网络,由 Martius 和 Lampert[47]提出,并与标准神经网络进行了对比。为简化可视化仅显示两个隐藏层,但网络深度会根据案例研究进行调控。

The EQL network uses a multi-layer feed-forward NN with one output node. A linear transformation z[l] is applied at every hidden layer(l),followed by a nonlinear transformation ai[l] using unary (i.e.,one argument) and binary (i.e., two arguments) activation functions as follows
EQL 网络采用带单输出节点的多层前馈神经网络。每个隐藏层(l)先进行线性变换 z[l] ,随后使用一元(单参数)和二元(双参数)激活函数进行非线性变换 ai[l] ,具体如下:

z[l]=W[l]a[l1]+b[l]

(20)ai[l]=fi(zi[l])


where {W,b} denote the weight parameters and fi denotes individual activation function from the library L={ identity, ()n,cos,sin,exp,log,sigmoid} . In a standard NN,the same activation function is applied to all hidden units and is typically chosen among {RELU, tanh, sigmoid, softmax, etc.}.
其中 {W,b} 表示权重参数, fi 表示从函数库 L={ (恒等函数、 ()n,cos,sin,exp,log,sigmoid} 等)中选取的个体激活函数。在标准神经网络中,所有隐藏单元采用相同的激活函数,通常从{RELU、tanh、sigmoid、softmax 等}中选择。



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_12.jpg?x=170&y=148&w=1305&h=552&r=0

Figure 8: Exemplary setup of a standard NN (8a) and EQL-NN (8b) with input x ,output y^ and two hidden layers. In (a), f denotes the activation function usually chosen among {RELU,tanh,sigmoid} while in EQL each node has a specific activation function drawn from the function class F .
图 8:标准神经网络(8a)与 EQL-NN(8b)的示例架构,包含输入 x 、输出 y^ 及两个隐藏层。(a)中 f 表示通常从{RELU、tanh、sigmoid}中选取的激活函数,而 EQL 中每个节点的激活函数均从函数类 F 中独立选择。


The problem reduces to learn the correct weight parameters {W[l],b[l]} ,whereas the operators of the target mathematical expression are selected during training. To overcome the interpretability limitation of neural network-based architectures and to promote simple over complex solutions as a typical formula describing a physical process,sparsity is enforced by adding a regularization term l1 to the l2 loss function such that,
问题简化为学习正确的权重参数 {W[l],b[l]} ,而目标数学表达式的运算符在训练过程中被选择。为克服神经网络架构可解释性限制,并促使描述物理过程的典型公式倾向于简单而非复杂解,通过向 l2 损失函数添加正则化项 l1 来强制稀疏性,具体表现为:

(21)=1Ni=1Ny^(xi)yi2+λl=1L|W[l]|1

Where N denotes the number of data entries and L denotes the number of layers. Whereas this method is end-to-end differentiable in NN parameters and scales well to high dimensional problems, back-propagation through activation functions such as division or logarithm requires simplifications to the search space, thus limiting its ability to produce simple expressions involving divisions (e.g., sin(x/y)x ). An extended version EQL ÷ [48] includes only the division, whereas exponential and logarithm activation functions are not included because of numerical issues.
其中 N 表示数据条目数量, L 表示层数。尽管该方法在神经网络参数上端到端可微分且能良好扩展至高维问题,但通过除法或对数等激活函数进行反向传播时需对搜索空间进行简化,因而限制了其生成包含除法(例如 sin(x/y)x )的简单表达式的能力。扩展版本 EQL ÷ [48]仅包含除法运算,由于数值稳定性问题未纳入指数和对数激活函数。

6 Tree expression  6 树表达式


This section discusses SR methods in which a mathematical expression is regarded as a unary-binary tree consisting of internal nodes and terminals. Every tree node represents a mathematical operation (e.g., +,,×,sin,log , etc.) that is drawn from a pre-defined function class F (Section 2.1) and every tree terminal node (or leaf) represents an operand, i.e., variable or constant, as illustrated for the example shown in Figure 9. Expression tree-based methods include genetic programming, transformers, and reinforcement learning.
本节讨论将数学表达式视为由内部节点和叶节点组成的单目-双目树结构的符号回归方法。每个树节点代表从预定义函数类 F (第 2.1 节)中选取的数学运算(如 +,,×,sin,log 等),每个叶节点代表操作数(即变量或常数),如图 9 示例所示。基于表达式树的方法包括遗传编程、Transformer 模型和强化学习。

6.1 Genetic Programming  6.1 遗传编程


Genetic programming (GP) is an evolutionary algorithm in computer science that searches the space of computer programs to solve a given problem. Starting with a "population" (set) of "individuals" (trees) that is randomly generated,GP evolves the initial population TGP(0) using a set of evolutionary "transition rules" (operations) {ri:ffiN} that is defined over the tree space. GP evolutionary operations include mutation, crossover, and selection. The mutation operation introduces random variations to an individual by replacing one subtree with another randomly generated subtree (Figure 10-right). The crossover operation involves exchanging content between two individuals, for example, by swapping one random subtree of one individual with another random subtree of another individual (Figure 10-left). Finally, the selection operation is used to select which individuals from the current population persist onto the next population. A common selection operator is tournament selection,in which a set of k candidate individuals are randomly sampled from the population, and the individual with the highest fitness i.e., a minimum loss is selected. In a GP algorithm, a single iteration corresponds to one generation. The application of one generation of GP on a population TGP(i) produces a new, augmented population TGP(i+1) . In each generation,each individual has a probability of undergoing a mutation operation and a probability of undergoing a crossover operation. The selection is applied when the dimension of the current population is the same as the previous one. Throughout Mk iterations,the following steps are undertaken: (1) transition rules are applied to the function set Fk={f1k,,fMkk} such that fk+1=ri(fk) where k denotes the iteration index; (2) the loss function (Fk) is evaluated for the set; and (3) an elite set of individuals is selected for the next iteration step. The GP algorithm repeats this procedure until a predetermined accuracy level is achieved.
遗传编程(GP)是计算机科学中的一种进化算法,用于在计算机程序空间中搜索以解决给定问题。GP 从随机生成的“个体”(树)组成的“种群”开始,通过定义在树空间上的一组进化“转换规则”(操作) {ri:ffiN} 来演化初始种群 TGP(0) 。GP 的进化操作包括变异、交叉和选择。变异操作通过将一个子树替换为另一个随机生成的子树(图 10 右侧)来为个体引入随机变化。交叉操作涉及两个个体之间的内容交换,例如,将一个个体的随机子树与另一个个体的随机子树进行交换(图 10 左侧)。最后,选择操作用于从当前种群中选出哪些个体能够延续到下一代种群中。常见的选择操作符是锦标赛选择,其中从种群中随机抽取一组 k 候选个体,并选择适应度最高(即损失最小)的个体。 在 GP 算法中,一次迭代对应一代。对种群 TGP(i) 应用一代 GP 会产生一个新的、扩增的种群 TGP(i+1) 。在每一代中,每个个体都有一定概率经历变异操作和交叉操作。当当前种群的维度与前一代相同时,会进行选择操作。经过 Mk 次迭代后,执行以下步骤:(1) 将转换规则应用于函数集 Fk={f1k,,fMkk} ,使得 fk+1=ri(fk) ,其中 k 表示迭代索引;(2) 评估该集合的损失函数 (Fk) ;(3) 为下一步迭代选择一组精英个体。GP 算法重复这一过程,直到达到预定的精度水平。




https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_13.jpg?x=297&y=158&w=1136&h=471&r=0

Figure 9: (a) Expression-tree structure of f(x)=x2cos(x) . (b) f(x) as a function of x (blue curve) and data points (red points) generated using f(x) .
图 9:(a) f(x)=x2cos(x) 的表达式树结构。(b) 以 x 为变量的 f(x) 函数(蓝色曲线)及通过 f(x) 生成的数据点(红色散点)。


https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_13.jpg?x=227&y=1135&w=1212&h=526&r=0

Figure 10: Crossover (left) and mutation (right) operations on exemplary expression trees in genetic programming.
图 10:遗传编程中典型表达式树的交叉操作(左)与变异操作(右)。


Whereas GP allows for large variations in the population resulting in improved performance for out-of-distribution data, GP-based methods do not scale well to high dimensional data sets and are highly sensitive to hyperpa-rameters [19].
尽管遗传编程允许种群中存在较大变异从而提升分布外数据的性能,但基于 GP 的方法难以扩展到高维数据集,且对超参数极为敏感[19]。

6.2 Transformers  6.2 变压器模型


Transformer neural network (TNN) is a novel NN architecture introduced by Vaswani et al. [46] in natural language processing (NLP) to model sequential data. TNN is based on the attention mechanism that aims to model long-range dependencies in a sequence. Consider the English-to-French translation of the two following sentences:
变压器神经网络(TNN)是由 Vaswani 等人在自然语言处理(NLP)领域提出的一种新型神经网络架构,用于建模序列数据。TNN 基于注意力机制,旨在捕捉序列中的长距离依赖关系。请看以下两个英语到法语的翻译例句:

En: The kid did not go to school because it was closed.
英文:孩子没去上学,因为学校关门了。


Fr: L'enfant n'est pas allé à l'école parce qu'elle était fermée.
法文:L'enfant n'est pas allé à l'école parce qu'elle était fermée.

En: The kid did not go to school because it was cold.
英文:孩子没去上学,因为天气太冷了。

Fr: L'enfant n'est pas allé à l'école parce qu'il faisait froid.
源文:孩子因为天气冷没有去上学。

The two sentences are identical except for the last word, which refers to the school in the first sentence (i.e., "closed") and to the weather in the second one (i.e., "cold"). Transformers create a context-dependent word embedding that it pays particular attention to the terms (of the sequence) with high weights. In this example, the noun that the adjective of each sentence refers to has a significant weight and is therefore considered for translating the word "it". Technically,an embedding xi is assigned to each element of the input sequence,and a set of m key-value pairs is defined,i.e., S={(k1,v1),,(km,vm)} . For each query,the attention mechanism computes a linear combination of values jωjvj ,where the attention weights (ωjqkj) are derived using the dot product between the query(q)and all keys (kj) ,as follows:
这两句话除了最后一个词外完全相同,第一句中的“closed”指代学校,而第二句中的“cold”则指代天气。Transformer 模型生成了一种依赖于上下文的词嵌入,特别关注序列中权重较高的词汇。在此例中,每个句子形容词所修饰的名词具有显著权重,因此在翻译“it”一词时会被纳入考量。从技术上讲,输入序列的每个元素被分配一个嵌入 xi ,并定义一组 m 键值对,即 S={(k1,v1),,(km,vm)} 。对于每个查询,注意力机制计算值的线性组合 jωjvj ,其中注意力权重 (ωjqkj) 通过查询(q)与所有键 (kj) 的点积得出,如下所示:

(22)Attention(q,S)=jσ(qkj)vj

Here, q=xWq is a query, ki=xiWk is a key, vi=xiWv is a value,and Wq,Wk,Wv are the learnable parameters. The architecture of the self-attention mechanism is illustrated in Figure 11,
此处, q=xWq 为查询, ki=xiWk 为键, vi=xiWv 为值, Wq,Wk,Wv 为可学习参数。自注意力机制的结构如图 11 所示,



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_14.jpg?x=244&y=844&w=1161&h=783&r=0

Input  输入

Figure 11: Evaluation of Attention(q,S)(Eq. 22) for a query qi ,computed using the input vector embedding xi .
图 11:针对查询 qi 的注意力评估(公式 22),使用输入向量嵌入 xi 计算得出。


In the context of SR,both input data points {(xi,yi)xiRd,yiR,iNn} and mathematical expressions f are encoded as sequences of symbolic representations as discussed in Section 2.2. The role of the transformer is to create the dependencies at two levels, first between numerical and symbolic sequences and between tokens of symbolic sequence. Consider the mathematical expression f(x,y,z)=sin(x/y)sin(z) ,which can be written as a sequence of tokens following the polish notation:
在符号回归(SR)的背景下,输入数据点 {(xi,yi)xiRd,yiR,iNn} 和数学表达式 f 均按照第 2.2 节所述被编码为符号表示序列。变换器的作用在于建立两个层面的依赖关系:首先是数值序列与符号序列之间的依赖,其次是符号序列内部标记间的依赖。以数学表达式 f(x,y,z)=sin(x/y)sin(z) 为例,其可按照波兰表示法写作标记序列:


-sin  正弦+xysin  正弦z


Each symbol is associated with an embedding such that:
每个符号都与一个嵌入相关联,使得:

x1:-x2:sinx3:÷x4:xx5:yx6:sinx7:z


In this particular example,for query (x7:z) ,the attention mechanism will give a higher weight for the binary operator (x1:) than for the variable (x5:y) or the division operator (x3:÷) .
在这个特定例子中,对于查询 (x7:z) ,注意力机制会给二元运算符 (x1:) 分配比变量 (x5:y) 或除法运算符 (x3:÷) 更高的权重。

Transformers consist of an encoder-decoder structure; each block comprises a self-attention layer and a feed-forward neural network. TNN inputs a sequence of embeddings {xi} and outputs a "context-dependent" sequence of embeddings {yi} one at a time,through a latent representation zi . TNN is an auto-regressive model, i.e., sampling each symbol is conditioned by the previously sampled symbols and the latent sequence. An example of a TNN encoder is shown in Figure 12. In symbolic regression case, the encoder and the decoder do not share the same vocabulary because the decoder has a mixture of symbolic and numeric representations, while the encoder has only numeric representations. There exist two approaches to solving SR problems using transformers. First is the skeleton approach [32, 49 where the transformer conducts the two-steps procedure: (1) the decoder predicts a skeleton fe ,a parametric function that defines the general shape of the target expression up to a choice of constants, using the function class F and (2) the constants are fitted using optimization techniques such as the non-linear optimization solver BFGS. For example,if f=cos(2x1)0.1exp(x2) ,then the decoder predicts fe=cos(x1)exp(x2) where - denotes an unknown constant. The second is an end-to-end (E2E) approach [31] where both the skeleton and the numerical values of the constants are simultaneously predicted. Both approaches are further discussed in Section 7.
Transformer 由编码器-解码器结构组成;每个块包含一个自注意力层和一个前馈神经网络。TNN 输入一个嵌入序列 {xi} ,并通过潜在表示 zi 逐个输出“上下文相关”的嵌入序列 {yi} 。TNN 是一个自回归模型,即每个符号的采样都依赖于先前采样的符号和潜在序列。图 12 展示了一个 TNN 编码器的示例。在符号回归案例中,编码器和解码器不共享相同的词汇表,因为解码器混合了符号和数值表示,而编码器仅有数值表示。使用 Transformer 解决符号回归问题存在两种方法。第一种是骨架方法[32,49],其中 Transformer 执行两步流程:(1)解码器预测一个骨架 fe ,这是一个参数化函数,定义了目标表达式的整体形状,但常数待定,使用函数类 F ;(2)通过非线性优化求解器 BFGS 等优化技术拟合常数。 例如,若 f=cos(2x1)0.1exp(x2) ,则解码器预测 fe=cos(x1)exp(x2) ,其中“-”代表未知常数。第二种是端到端(E2E)方法[31],其中骨架和常数的数值同时被预测。这两种方法将在第 7 节进一步讨论。



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_15.jpg?x=490&y=477&w=665&h=861&r=0

Figure 12: Structure of a TNN encoder [46]. It comprises an attention layer and a feed-forward neural network.
图 12:TNN 编码器结构[46]。它包含一个注意力层和一个前馈神经网络。


6.3 Reinforcement learning
6.3 强化学习


Reinforcement learning provides a framework for learning and decision-making by trial and error [50]. An RL Setting consists of four components(S,A,P,R)in a Markov decision process. In this setting,an agent observes a state sS of the environment and,based on that,takes action aA ,which results in a reward r=R(s,a) , and the environment then transitions to a new state sS . The interaction goes on in time steps until a terminal state is reached. The aim of the agent is to learn the policy P (also called transition dynamics),which is a mapping from states to actions that maximize the expected cumulative reward. An exemplary sketch of an RL-based SR method is illustrated in Figure 13.
强化学习通过试错提供了一个学习和决策的框架[50]。一个强化学习设置由马尔可夫决策过程中的四个组成部分(S,A,P,R)构成。在此设置中,智能体观察环境的状态 sS ,并基于此采取行动 aA ,从而获得奖励 r=R(s,a) ,随后环境转移到新状态 sS 。这种交互随时间步持续,直至达到终止状态。智能体的目标是学习策略 P (也称为转移动态),即从状态到行动的映射,以最大化预期累积奖励。图 13 展示了一个基于强化学习的符号回归方法示例示意图。

SR problem can be framed in RL as follows: the agent (NN) observes the environment (parent and sibling in a tree) and, based on the observation, takes an action (predict the next token of the sequence) and transitions
SR 问题可以如下在强化学习(RL)中构建框架:智能体(神经网络)观察环境(树结构中的父节点与兄弟节点),基于观察采取行动(预测序列的下一个标记),并进行状态转移。




https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_16.jpg?x=280&y=160&w=1076&h=506&r=0

Figure 13: Exemplary sketch of a general RL-based SR method. st,at ,and rt=R(st,at) denote the state, action,and reward at time step t.(t+1) denotes the next time step.
图 13:一个典型的基于强化学习的 SR 方法示意图。 st,atrt=R(st,at) 分别表示时间步 t.(t+1) 的状态、动作及奖励, t.(t+1) 代表下一时间步。


into a new state. In this view, the NN model is like a policy, the parent and sibling are like observations, and sampled symbols are like actions.
进入新状态。在此视角下,神经网络模型类似于策略,父节点与兄弟节点如同观测值,而采样符号则类比于动作。

7 Applications  7 应用场景


Most existing algorithms for solving SR are GP-based, whereas many others, and more recent, are deep learning (DL)-based. There exist two different strategies to solve SR problems, as illustrated in the taxonomy of Figure 14.
大多数现有的解决符号回归(SR)问题的算法基于遗传规划(GP),而其他许多(尤其是较新的)方法则基于深度学习(DL)。如图 14 的分类所示,存在两种不同的策略来解决 SR 问题。



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_16.jpg?x=254&y=1139&w=1136&h=480&r=0

Figure 14: Strategies for solving SR problem. An SR algorithm has three types of input: data (x), a new or reduced representation of the data (xor z) ,or a model (f(x)) learned from the data.
图 14:解决 SR 问题的策略。SR 算法有三种输入类型:数据(x)、数据的新或简化表示 (xor z) ,或从数据中学习的模型 (f(x))


The first is a one-step approach, where data points are directly fed into an SR algorithm. A second is a two-step approach involving a process which either learns a new representation of data or learns a "blackbox" model, which will be then fed into SR algorithm as described below:
第一种是一步到位的方法,直接将数据点输入到符号回归(SR)算法中。第二种是分两步走的方法,涉及一个过程:要么学习数据的新表示,要么学习一个“黑箱”模型,随后如后文所述将其输入到 SR 算法中。

  1. Learn a new representation of the original data set by defining new features (reducing the number of independent variables) or a reduced representation using specific NN architectures such as principal component analysis and autoencoders.
    通过学习定义新特征(减少自变量数量)或利用特定神经网络架构(如主成分分析和自编码器)获得降维表示,从而学习原始数据集的新表示。

  1. Learn a `"blackbox"` model either using regular NN or using conceptual NN such as graph neural network (GNN). In this case, an SR algorithm is applied to the learned model or parts of it.
    使用常规神经网络或概念性神经网络(如图神经网络 GNN)学习一个“黑箱”模型。此时,符号回归算法会应用于已学习模型或其部分组件上。

We group the applications based on the categories presented in Section 3, and we summarize them in Table 5 GP-based applications will not be reviewed here; they are listed in the living review [25], along with DL-based applications. State-of-the-art GP-based methods are discussed in detail in [54]. Among GP-based applications is the commercial software Eureqa [26], the most well-known GP-based method that uses the algorithm proposed by Schmidt and Lipson in [2]. Eureqa is used as a baseline SR method in several research works.
我们根据第 3 节中提出的类别对应用进行分组,并在表 5 中进行了总结。基于遗传规划(GP)的应用在此不做评述,它们与基于深度学习(DL)的应用一同列在持续更新的综述[25]中。文献[54]详细讨论了最先进的基于 GP 的方法。在基于 GP 的应用中,商业软件 Eureqa[26]最为知名,它采用了 Schmidt 和 Lipson 在文献[2]中提出的算法,并在多项研究工作中被用作符号回归(SR)方法的基准。



Table 5: Table summarizing symbolic regression applications. D refers to data input, 1 refers to data representation input, and 2 refers to model input to SR.
表 5:符号回归应用总结表。D 代表数据输入,1 代表数据表示输入,2 代表模型输入至 SR。

Application  应用Ref  参考文献Name/description(code)  名称/描述(代码)Year  年份Method  方法Strategy  策略
SINDY51Sparse Identification of Nonlinear Dynamics link
非线性动力学稀疏辨识链接
2016Linear  线性D
SINDY-AE33Data-driven discovery of coordinates and gov- erning equations (https://github.com/kpchamp/ SindyAutoencoders
数据驱动的坐标与主导方程发现 (https://github.com/kpchamp/SindyAutoencoders)
2019Linear  线性1
EQL47Equation learner (https://github.com/ KristofPusztai/EQL
方程学习器 (https://github.com/KristofPusztai/EQL)
2016non linear  非线性D
EQL÷48Equation learner division (https://github.com/ martius-lab/EQL
方程学习器分割(https://github.com/martius-lab/EQL)
2018Non linear  非线性D
Eureqa  尤里卡26Commercial software  商业软件2011GPD
FFX20Fast function extraction (https://github.com/ natekupp/ffx/tree/master/ffx)
快速函数提取 (https://github.com/ natekupp/ffx/tree/master/ffx)
2011GPD
ITEA22Interaction-Transformation Evolutionary Algorithm for SR (https://github.com/folivetti/ITEA/)
符号回归的交互-变换进化算法 (https://github.com/folivetti/ITEA/)
2019GPD
MRGP23(https://github.com/ flexgp/gp-learners)2014GPD
E2ESR31(https://github.com/facebookresearch/ symbolicregression)2022TNND
NeSymReS32(https:// github.com/SymposiumOrganization/ NeuralSymbolicRegressionThatScales)
(https://github.com/SymposiumOrganization/NeuralSymbolicRegressionThatScales)
2021TNND
DSR19Deep symbolic regression (https://github.com/ brendenpetersen/deep-symbolic-regression)
深度符号回归(https://github.com/brendenpetersen/deep-symbolic-regression)
2019RNN,RL  循环神经网络,强化学习D
NGPPS29SR via Neural-Guided GP population seed- (https://github.com/brendenpetersen/ deep-symbolic-regression)
通过神经引导遗传编程种群种子实现的符号回归-(https://github.com/brendenpetersen/deep-symbolic-regression)
2021RNN,GP,RL  循环神经网络,遗传编程,强化学习D
AIFeynman  AI 费曼算法27Physics-inspired method for SR (https://github.com/SJ001/AI-Feynman
物理启发的 SR 方法(https://github.com/SJ001/AI-Feynman)
2019Physics-informed  物理信息驱动1
SM30Symbolic Metamodel (https://bitbucket.org/ mvdschaar/mlforhealthlabpub)
符号元模型(https://bitbucket.org/ mvdschaar/mlforhealthlabpub)
2019Mathematics  数学2
GNN52Discovering Symbolic Models from DL with Induc- tive Biases (https://github.com/MilesCranmer/ symbolic_deep_learning)
通过归纳偏置从深度学习中发现符号模型(https://github.com/MilesCranmer/symbolic_deep_learning)
2020GNN2
HEAL53Heuristic and Evolutionary Algorithms Labo- ratory (https://github.com/heal-research/ HeuristicLab
启发式与进化算法实验室(https://github.com/heal-research/HeuristicLab)
-Heuristic  启发式D



SINDY-AE [33] is a hybrid SR method that combines autoencoder network [55] with linear SR [51]. The novelty of this approach is in simultaneously learning sparse dynamical models and reduced representations of coordinates that define the model using snapshot data. Given a data set x(t)Rn ,this method seeks to learn coordinate transformations from original to intrinsic coordinates z=ϕ(x) (encoder) and back via x=ψ(z) (decoder),along with the dynamical model associated with the set of reduced coordinates z(t)Rd(dn) :
SINDY-AE [33]是一种混合符号回归方法,结合了自编码器网络[55]与线性符号回归[51]。该方法的创新点在于同时学习稀疏动力学模型和利用快照数据定义模型的降维坐标表示。给定数据集 x(t)Rn ,此方法旨在学习从原始坐标到内在坐标 z=ϕ(x) (编码器)及反向 x=ψ(z) (解码器)的坐标变换,并建立与降维坐标集 z(t)Rd(dn) 相关的动力学模型:

(23)ddtz(t)=g(z(t))

through a customized loss function L ,defined as a sum of four terms:
通过一个定制化的损失函数 L ,定义为四项之和:

(24)L=xψ(ϕ(x))22reconstruction error +λ1z˙z˙pred 22encoder loss +λ2x˙x˙pred 22decoder loss +λ3Θ1regularizer loss 

Here the derivative of the reduced variables z are computed using the derivatives of the original variable x , i.e. z˙=xϕ(x)x˙ . Predicted coordinates denoted as apred  represent NN outputs and are expressed in terms of coefficient vector Θ and library matrix U(x) following Eq. 19,i.e., zrec =U(zT)Θ=U(ϕ(x)T)Θ . The library is specified before training,and the coefficients Θ are learned with the NN parameters as part of the training procedure.
此处,约化变量的导数 z 通过原变量 x 的导数计算得出,即 z˙=xϕ(x)x˙ 。预测坐标表示为 apred  ,作为神经网络输出,依据式 19 由系数向量 Θ 和库矩阵 U(x) 表达,即 zrec =U(zT)Θ=U(ϕ(x)T)Θ 。该库在训练前已确定,而系数 Θ 则与神经网络参数一同在训练过程中学习得到。

A case study is the nonlinear pendulum motion whose dynamics are governed by a second-order differential equation given by x¨=sin(x) . The data set is generated as a series of snapshot images from a simulated video of a nonlinear pendulum. After training,the SINDY autoencoder correctly identified the equation z¨=0.99sinz , which is the dynamical model of a nonlinear pendulum in the reduced representation. This approach is particularly efficient when the dynamical model may be dense in terms of functions of the original measurement coordinates x . This method and similar works [56 make the path to "Gopro physics" where researchers point a camera on an event and get back an equation capturing the underlying phenomenon using an algorithm.
一个案例研究是非线性摆运动,其动力学由二阶微分方程 x¨=sin(x) 描述。数据集是通过模拟非线性摆运动的视频生成的一系列快照图像。训练后,SINDY 自动编码器正确识别出了方程 z¨=0.99sinz ,这是非线性摆在降维表示中的动力学模型。当动力学模型在原测量坐标的函数表达上可能较为密集时,此方法尤为高效 x 。该方法及类似研究[56]开辟了“Gopro 物理”之路,即研究人员用摄像头对准某个现象,通过算法获取描述其背后规律的方程。

Despite successful applications involving partial differential equations, still, one main limitation of this method is in its linear SR part. For example,a model expressed as f(x)=x1x22x2exp(x3)+12exp(2x1x3) is discovered only if each term of this expression is comprised in the library,e.g., exp(2x1x2) . The presence of the exponential function,i.e., exp(x) ,is insufficient to discover the second and the third terms.
尽管在偏微分方程应用中取得了成功,但该方法的一个主要局限仍在于其线性符号回归部分。例如,只有当表达式 f(x)=x1x22x2exp(x3)+12exp(2x1x3) 中的每一项都包含在函数库中时(如 exp(2x1x2) 所示),才能发现该模型。而指数函数 exp(x) 的存在,并不足以发现第二和第三项。

Symbolic metamodel [30] (SM) is a model-of-a-model method for interpreting "blackbox" model predictions. It inputs a learned "blackbox") model and outputs a symbolic expression. Available post-hoc methods aim to explain ML model predictions, i.e., they can explain some aspects of the prediction but can not offer a full model interpretation. In contrast, SM is interpretable because it uncovers the functional form that underlies the learned model. The symbolic metamodel is based on Meijer G -function [57,58],which is a special univariate function characterized by a set of indices,i.e., Gp,qm,n(ap,bqx) ,where a and b are two sets of real-values parameters. An instance of the Meijer G -function is specified by(a,b),for example the function G2,21,2(a,ba,ax) takes different forms for different settings of the parameters a and b ,as illustrated in Figure 15 .
符号元模型[30](SM)是一种用于解释“黑盒”模型预测的模型之模型方法。它输入一个已学习的“黑盒”模型并输出符号表达式。现有的后验方法旨在解释机器学习模型的预测,即它们可以解释预测的某些方面,但无法提供完整的模型解释。相比之下,SM 是可解释的,因为它揭示了学习模型背后的函数形式。符号元模型基于 Meijer G -函数[57,58],这是一种特殊的单变量函数,由一组索引表征,即 Gp,qm,n(ap,bqx) ,其中 ab 是两组实值参数。Meijer G -函数的一个实例由(a,b)指定,例如函数 G2,21,2(a,ba,ax) 对于参数 ab 的不同设置会呈现不同形式,如图 15 所示。

In the context of SR problem solving, the target mathematical expression is defined as a parameterization of the Meijer function,i.e., {g(x)=G(θ,x)θ=(a,b)} ,thus reducing the optimization task to a standard parameter optimization problem that can be efficiently solved using gradient descent algorithms θk+1:= θkγil(G(xi,θ),f(xi))|θ=θk . The parameters a and b are learned during training,and the indices(m,n,p,q) are regarded as hyperparameters of the model. SM was tested on both synthetic and real data and was deployed in two modes spanning (1) only polynomial expressions (SMp) and (2) closed-form expressions (SMc) , in comparison to a GP-based SR method. SMp produces accurate polynomial expressions for three out of four tested functions (except the Bessel function),whereas SMc produces the correct ground-truth expression for all four functions and significantly outperforms GP-based SR.
在解决符号回归问题的背景下,目标数学表达式被定义为 Meijer 函数的参数化形式,即 {g(x)=G(θ,x)θ=(a,b)} ,从而将优化任务简化为一个标准的参数优化问题,该问题可以通过梯度下降算法高效求解 θk+1:= θkγil(G(xi,θ),f(xi))|θ=θk 。参数 ab 在训练过程中学习得到,而索引(m,n,p,q)被视为模型的超参数。SM 在合成数据和真实数据上均进行了测试,并以两种模式部署:(1)仅多项式表达式 (SMp) 和(2)闭式表达式 (SMc) ,与基于遗传规划的符号回归方法进行比较。 SMp 为四个测试函数中的三个(除 Bessel 函数外)生成了精确的多项式表达式,而 SMc 为所有四个函数生成了正确的基础真值表达式,其性能显著优于基于遗传规划的符号回归方法。

More generally, consider a problem in a critical discipline such as healthcare. Assuming a feature vector comprising (age, gender, weight, blood pressure, temperature, disease history, profession, etc.) with the aim to predict the risk of a given disease. Predictions made by a "blackbox" could be highly accurate. However, the learned model does not provide insights into why the risk is high or low for a patient and what parameter is the most critical or weightful in the prediction. Applying the symbolic metamodel to the learned model outputs a symbolic expression,e.g., f(x1,x2)=x1(1exp(x2)) ,where x1 is the blood pressure and x2 is the age. Here, we can learn that only two features (out of many others) are crucial for the prediction and that the risk increases with high blood pressure and decreases with age. This is an ideal example showing the difference between "blackbox" and interpretable models. In addition, it is worth mentioning that methods applied for model interpretation only exploit part of the prediction and can not unveil how the model captures nonlinearities in the data. Thus model interpretation methods are insufficient to provide full insights into why and how model predictions are made and are not by any means equivalent to interpretable models.
更一般地,考虑一个关键领域(如医疗保健)中的问题。假设一个特征向量包含(年龄、性别、体重、血压、体温、病史、职业等),目的是预测某种特定疾病的风险。由“黑箱”模型做出的预测可能非常准确。然而,学习到的模型并不能解释为什么某个病人的风险高或低,以及哪个参数在预测中最为关键或权重最大。将符号元模型应用于学习到的模型,会输出一个符号表达式,例如 f(x1,x2)=x1(1exp(x2)) ,其中 x1 代表血压, x2 代表年龄。由此我们可以了解到,在众多特征中只有这两个对预测至关重要,且风险随血压升高而增加,随年龄增长而降低。这是一个展示“黑箱”模型与可解释模型之间差异的理想例子。此外,值得一提的是,用于模型解释的方法仅利用了预测的一部分,无法揭示模型如何捕捉数据中的非线性关系。 因此,模型解释方法不足以全面揭示模型预测的缘由与机制,且无论如何也无法等同于可解释模型。




https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_19.jpg?x=470&y=151&w=726&h=472&r=0

Figure 15: Example of a Meijer G-function G1,12,2(a,ba,ax) for different values of a and b [30].
图 15:不同参数 ab 下 Meijer G 函数 G1,12,2(a,ba,ax) 的示例[30]。


End-to-end symbolic regression [31 (E2ESR) is a transformer-based method that uses end-to-end learning to solve SR problems. It is made up of three components: (1) an embedder that maps each input point (xi,yi) to a single embedding,(2) a fully-connected feedforward network,and (3) a transformer that outputs a mathematical expression. What distinguishes E2ESR from other transformer-based applications is the use of an end-to-end approach without resorting to skeletons, thus using both symbolic representations for the operators and the variables and numeric representations for the constants. Both input data points {(xi,yi)iNn} and mathematical expressions f are encoded as sequences of symbolic representations following the description in Section 2.2. E2ESR is tested and compared to several GP-based and DL-based applications on SR benchmarks. Results are reported in terms of mean accuracy, formula complexity, and inference time, and it was shown E2ESR achieves very competitive results for SR and outperforms previous applications.
端到端符号回归(E2ESR)是一种基于 Transformer 的方法,采用端到端学习解决 SR 问题。它由三个组件构成:(1) 嵌入器,将每个输入点 (xi,yi) 映射为单一嵌入表示;(2) 全连接前馈网络;(3) 输出数学表达式的 Transformer。E2ESR 区别于其他基于 Transformer 应用的关键在于采用端到端方法,无需依赖预设结构框架,同时使用符号表示处理运算符和变量,数值表示处理常量。输入数据点 {(xi,yi)iNn} 和数学表达式 f 均按照第 2.2 节的描述编码为符号表示序列。E2ESR 在 SR 基准测试中与多种基于遗传规划和深度学习的应用进行比较,结果显示其在平均准确率、公式复杂度和推理时间方面均具有极强竞争力,且性能优于先前应用。

AIFeynman [27] is a physics-inspired SR method that recursively applies a set of solvers, i.e., dimensional analysis 3 ,polynomial fit,and brute-force search to solve an SR problem. If the problem is not solved,the algorithm searches for simplifying intrinsic properties in data (e.g. invariance, factorization) using NN and deploys them to recursively simplify the dataset into simpler sub-problems with fewer independent variables. Each sub-problem is then tackled by a symbolic regression method of choice. The authors created the Feynman SR database (see Section 8) to test their approach. All the basic equations and 90% of the bonus equations were solved by their algorithm, outperforming Eureqa.
AIFeynman [27] 是一种受物理学启发的符号回归方法,它递归地应用一组求解器(如量纲分析 3 、多项式拟合和暴力搜索)来解决符号回归问题。若问题未解决,该算法会利用神经网络寻找数据中的简化内在属性(例如不变性、可分解性),并运用这些属性递归地将数据集简化为具有更少自变量的更简单子问题。每个子问题随后由选定的符号回归方法处理。作者创建了 Feynman 符号回归数据库(见第 8 节)以测试其方法。该算法成功求解了所有基础方程及 90%的附加方程,其表现优于 Eureqa。

Deep Symbolic Regression (DSR) [19] is an RL-based search method for symbolic regression that uses a generative recurrent neural network (RNN). RNN defines a probability distribution (p(θ)) over mathematical expressions (τ) ,and batches of expressions T={τ(i)}i=1N are stochastically generated. An exemplary sketch of how RNN generates an expression (e.g., x2cos(x) ) is shown in Figure 16. Starting with the first node following the pre-order traversal (Section 2.2) of an expression tree, RNN is initially fed with empty placeholders tokens (a parent and a sibling) and produces a categorical distribution, i.e., outputs the probability of selecting every token from the defined library L={+,,×,÷,sin,cos,log, etc. } . The sampled token is fed into the first node, and the number of siblings is determined based on whether the operation is unary (one sibling) or binary (two siblings). The second node is then selected, and the RNN is fed with internal weights along with the first token and outputs a new (and potentially different) categorical distribution. This procedure is repeated until the expression is complete. Expressions are then evaluated with a reward function R(τ) to test the goodness of the fit to the data D for each candidate expression(f)using normalized root-mean-square error,
深度符号回归(DSR)[19]是一种基于强化学习的符号回归搜索方法,采用生成式循环神经网络(RNN)。RNN 定义了数学表达式 (p(θ)) 的概率分布,并随机生成批量表达式 (τ) 。图 16 展示了 RNN 如何生成表达式(例如 T={τ(i)}i=1N )的示例流程。从表达式树的前序遍历(第 2.2 节)的第一个节点开始,RNN 初始接收空占位符标记(父节点和兄弟节点),并输出分类分布,即从预定义库 x2cos(x) 中选择每个标记的概率等 L={+,,×,÷,sin,cos,log, 。采样得到的标记被输入首个节点,根据操作是一元(一个兄弟节点)还是二元(两个兄弟节点)确定兄弟节点数量。接着选择第二个节点,RNN 在接收内部权重及首个标记后输出新的(可能不同的)分类分布。此过程重复直至表达式构建完成。 随后通过奖励函数 R(τ) 评估表达式,以测试每个候选表达式(f)对数据的拟合优度 D ,采用归一化均方根误差作为衡量标准。

R(τ)=1/(1+1σy1ni=1n(yif(Xi))2).




3 Dimensional analysis is a well-known technique in physics that uses set of units of measurements to solve an equation and/or to check the correctness of a given equation.
3 量纲分析是物理学中一种广为人知的技术,它利用测量单位集来求解方程和/或验证给定方程的正确性。







https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_20.jpg?x=145&y=138&w=1352&h=802&r=0

Figure 16: Exemplary sketch of RNN generating a mathematical expression x2cos(x) .
图 16:RNN 生成数学表达式的示例示意图 x2cos(x)


To generate better expressions(f),the probability distribution p(τθ) needs to be optimized. Using a gradient-based approach for optimization requires the reward function R(τ) to be differentiable with respect to the RNN parameter θ ,which is not the case. Instead,the learning objective is defined as the expectation of the reward under expressions from the policy,i.e., J(θ)=Eτp(τθ)[R(τ)] ,and reinforcement learning is used to maximize J(θ) by means of the "standard policy gradient":
为生成更优的表达式(f),需优化概率分布 p(τθ) 。采用基于梯度的优化方法要求奖励函数 R(τ) 对 RNN 参数 θ 可微,但实际情况并非如此。因此,学习目标被定义为策略下表达式奖励的期望,即 J(θ)=Eτp(τθ)[R(τ)] ,并通过强化学习利用“标准策略梯度”来最大化 J(θ)

(25)θJ(θ)=θEτp(τθ)[R(τ)]=Eτp(τθ)[R(τ)θlogp(τθ)]

This reinforcement learning trick, called REINFORCE [59], can be derived using the definition of the expectation E[] and the derivative of log() function as follows:
这一强化学习技巧,称为 REINFORCE[59],可通过期望的定义 E[]log() 函数的导数推导如下:

θEτp(τθ)[R(τ)]=θR(τ)p(τθ)dθ

=R(τ)θp(τθ)dθ

(26)=R(τ)θp(τθ)p(τθ)p(τθ)dθ

=R(τ)log(p(τθ)p(τθ)dθ

=Eτp(τθ)[R(τ)θlogp(τθ)]

The importance of this result is that it allows estimating the expectation using samples from the distribution. More explicitly,the gradient of J(θ) is estimated by computing the mean over a batch of N sampled expressions as follows:
该结果的重要性在于,它允许利用分布中的样本估计期望值。更明确地说,通过计算一批 N 采样表达式的均值来估计 J(θ) 的梯度,如下所示:

(27)θJ(θ)=1Ni=1NR(τ(i))θlogp(τ(i)θ)

The standard policy gradient (Eq. 25) permits optimizing a policy's average performance over all samples from the distribution. Since SR requires maximizing best-case performance, i.e., to optimize the gradient over the top ϵ fraction of samples from the distribution found during training,a new learning objective is defined as a conditional expectation of rewards above the (1ϵ) -quantile of the distribution of rewards,as follows:
标准策略梯度(式 25)允许优化策略在分布所有样本上的平均表现。由于 SR 要求最大化最佳情况性能,即在训练过程中发现的分布前 ϵ 比例样本上优化梯度,因此将新的学习目标定义为奖励高于分布奖励 (1ϵ) 分位数的条件期望,如下:

(28)Jrisk (θ,ϵ)=Eτp(τθ)[R(τ)R(τ)Rϵ(θ)]


where Rϵ(θ) represent the samples from the distribution below the ϵ -threshold. The gradient of the new learning objective is given by:
其中 Rϵ(θ) 代表分布中低于 ϵ 阈值的样本。新学习目标的梯度由下式给出:

(29)θJrisk (θ)=Eτp(τθ)[(R(τ)Rϵ(θ))θlogp(τθ)R(τ)Rϵ(θ)]

DSR was essentially evaluated on the Nguyen SR benchmark and several additional variants of this benchmark. An excellent recovery rate was reported for each set, and DSR solved all mysteries except the Nguyen-12 benchmark given by x4x3+12y2y . More details on SR data benchmarks can be found in Section 8.
DSR 主要在 Nguyen SR 基准测试及其多个变体上进行评估。报告显示,除 Nguyen-12 基准测试(由 x4x3+12y2y 给出)外,DSR 对所有测试集均表现出优异的恢复率,并成功解决了所有谜题。更多关于 SR 数据基准的细节详见第 8 节。

Neural-guided genetic programming population seeding [29 (NGPPS) is a hybrid method that combines GP and RNN [19] and leverages the strengths of each of the two components. Whereas GP begins with random starting populations, the authors in [29] propose to use the batch of expressions sampled by RNN as a staring population for GP: TGP(0)=TRNN . Each iteration of the proposed algorithm consists of 4 steps: (1) The batch of expressions sampled by RNN is passed as a starting population to GP,(2) S generations of GP are performed and result in a final GP population TGPS ,(3) An elite set of top-performing GP samples is selected TGPE and passed to the gradient update of RNN.
神经引导的遗传编程种群初始化[29](NGPPS)是一种结合了 GP 和 RNN[19]的混合方法,充分发挥了两者的优势。与传统 GP 从随机初始种群开始不同,文献[29]提出使用 RNN 采样的表达式批次作为 GP 的初始种群: TGP(0)=TRNN 。该算法的每次迭代包含 4 个步骤:(1)将 RNN 采样的表达式批次作为初始种群传递给 GP;(2)执行 S 代 GP 演化,得到最终 GP 种群 TGPS ;(3)筛选出表现最优的 GP 样本精英集 TGPE ,用于 RNN 的梯度更新。



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_21.jpg?x=406&y=733&w=857&h=590&r=0

Figure 17: Neural-guided genetic programming population seeding method overview [29].
图 17:神经引导遗传编程种群初始化方法概述[29]。


Neural symbolic regression that scales [32] (NeSymReS) is a transformer-based algorithm that emphasizes large-scale pre-training. It comprises a pre-training and test phase. Pre-training includes data generation and model training. Hundreds of millions of training examples are generated for every minibatch in pre-training. Each training example consists of a symbolic equation fe and a set of n input-output pairs {xi,yi=f(xi)} where n can vary across examples,and the number of independent input variables is at most three. In the test phase,a set of input-output pairs {xi,yi} is fed into the encoder that maps it into a latent vector z ,and the decoder iteratively samples candidates' skeletons. What distinguishes this method is the learning task, i.e., it improves over time with experience, and there is no need to be retrained from scratch on each new experiment. It was shown that NeSymReS outperforms selected baselines (including DSR) in time and accuracy by a large margin on all datasets (AI-Feynman, Nguyen, and strictly out-of-sample equations (SOOSE) with and without constants). NeSymReS is more than three orders of magnitudes faster at reaching the same maximum accuracy as GP while only running on CPU.
可扩展的神经符号回归方法[32](NeSymReS)是一种基于 Transformer 的算法,强调大规模预训练。该方法包含预训练和测试两个阶段。预训练阶段包括数据生成和模型训练,每个小批次预训练中会生成数亿个训练样本。每个训练样本由一个符号方程 fe 和一组输入-输出对 n 组成 {xi,yi=f(xi)} (其中 n 在不同样本间可变,且自变量数量不超过三个)。测试阶段将一组输入-输出对 {xi,yi} 输入编码器映射为潜在向量 z ,解码器则迭代采样候选方程骨架。该方法的独特之处在于其学习机制——随着经验积累持续改进,且无需在新实验时从头重新训练。实验表明,NeSymReS 在所有数据集(AI-Feynman、Nguyen 及严格样本外方程 SOOSE 含/不含常数)上的时间和准确率均大幅优于选定基线方法(包括 DSR)。 NeSymReS 在仅运行于 CPU 的情况下,达到与遗传编程相同的最高精度时,速度快了三个数量级以上。

GNN [52] is a hybrid scheme performing SR by training a Graph Neural Network (GNN) and applying SR algorithms on GNN components to find mathematical equations.
GNN [52]是一种混合方案,通过训练图神经网络(GNN)并在其组件上应用符号回归算法来寻找数学方程。

A case study is Newtonian dynamics which describes the dynamics of particles in a system according to Newton’s laws of motion. D consists of an N -body system with known interaction (force law F such as electric, gravitation, spring, etc.), where particles (nodes) are characterized by their attributes (mass, charge, position, velocity, and acceleration) and their interaction (edges) are assigned the attribute of dimension 100 . The GNN functions are trained to predict instantaneous acceleration for each particle using the simulated data and then applied to a different data sample. The study shows that the most significant edge attributes,say {e1,e2} , fit to a linear combination of the true force components, {F1,F2} ,which were used in the simulation showing that edge attributes can be interpreted as force laws. The most significant edge attributes were then passed into Eureqa to uncover analytical expressions that are equivalent to the simulated force laws. The proposed approach was also applied to datasets in the field of cosmology, and it discovered an equation that fits the data better than the existing hand-designed equation.
一个案例研究是牛顿动力学,它根据牛顿运动定律描述系统中粒子的动态。 D 包含一个已知相互作用(力定律 F ,如电力、引力、弹簧力等)的 N 体系统,其中粒子(节点)由其属性(质量、电荷、位置、速度和加速度)表征,而它们的相互作用(边)被赋予维度 100 的属性。GNN 函数经过训练,使用模拟数据预测每个粒子的瞬时加速度,然后应用于不同的数据样本。研究表明,最重要的边属性,比如 {e1,e2} ,与模拟中使用的真实力分量 {F1,F2} 的线性组合相吻合,这表明边属性可以解释为力定律。随后,这些最重要的边属性被输入 Eureqa,以发现与模拟力定律等效的解析表达式。所提出的方法还被应用于宇宙学领域的数据集,并发现了一个比现有手工设计方程更符合数据的方程。


The same group has recently succeeded in inferring Newton's law for gravitational force using GNN and PySR for symbolic regression task [34]. GNN was trained using observed trajectories (position) of the Sun, planets, and moons of the solar system collected during 30 years. The SR algorithm could correctly infer Newton's formula that describes the interaction between masses,i.e., F=GM1M2/r2 ,and the masses and the gravitational constant as well.
该研究团队近期成功运用图神经网络(GNN)和符号回归工具 PySR,通过 30 年收集的太阳系中太阳、行星及卫星运行轨迹(位置)数据训练模型,推导出牛顿万有引力定律[34]。符号回归算法准确推断出描述质量间相互作用(即 F=GM1M2/r2 )的牛顿公式,同时识别出质量参数与万有引力常数。

8 Datasets  8 数据集


For symbolic regression purposes, there exist several benchmark data sets that can be categorized into two main groups: (1) ground-truth problems (or synthetic data) and (2) real-world problems (or real data), as summarized in Figure 18. In this section, we describe each category and discuss its main strength and limitations.
针对符号回归任务,现有基准数据集可分为两大类型:(1) 基准真值问题(合成数据)和 (2) 现实问题(真实数据),如图 18 所示。本节将分别阐述每类数据集的特点及其主要优势与局限性。



https://cdn.noedgeai.com/0196b52b-6cde-7a94-8571-d8fe11e76e16_22.jpg?x=357&y=756&w=937&h=275&r=0

Figure 18: Taxonomy based on the type of SR benchmark problems.
图 18:基于 SR 基准问题类型的分类法。


Ground-truth regression problems are characterized by known mathematical equations, they are listed in Table 6 These include (1) physics-inspired equations [27, 60] and (2) real-valued symbolic equations [1, 14, 15, 16, 17, 18, 19, 61 .
真实回归问题的特征在于已知数学方程,它们列于表 6 中,包括:(1) 物理启发的方程[27,60];(2) 实值符号方程[1,14-19,61]。


Table 6: Table summarizing ground-truth problems for symbolic regression.
表 6:符号回归真实问题汇总表。

Type  类型Benchmark  基准测试Number of problems  问题数量Year  年份Reference  参考文献
Physics-related  物理学相关Feynman Database  费曼数据库119201927
Strogatz Repository  斯特罗加茨知识库10201160
Mathematics-related  数学相关Koza  科扎319941
Keijzer  凯泽15200314
Vladislavleva  弗拉季斯拉列娃8200915
Nguyen  12201117
Korns  科恩斯15201116
R3201361
Jin  6201918
Livermore  利弗莫尔22202119


The Feynman Symbolic Regression Database [62] is the largest SR database that originates from Feynman lectures on Physics series [63, 64, and is proposed in 27 . It consists of 1194 physics-inspired equations that describe static physical systems and various physics processes. The proposed equations depend on at least one variable and, at most, nine variables. Each benchmark (corresponding to one equation) is generated by randomly sampling one million entries. Each entry is a row of randomly generated input variables, which are sampled uniformly between 1 and 5 . This range of sampling was slightly adjusted for some equations to avoid unphysical results (e.g., division by zero or the square root of a negative number). The output is evaluated using function f ,e.g. D={xiRd,yi=f(x1,,xd)} .
费曼符号回归数据库[62]是源自费曼物理学讲座系列[63,64]的最大 SR 数据库,并于文献 27 中提出。该库包含 1194 个受物理学启发的方程,用于描述静态物理系统及各类物理过程。所提出的方程至少依赖一个变量,最多涉及九个变量。每个基准(对应一个方程)通过随机采样一百万条数据生成,每条数据是一行随机生成的输入变量,其数值在 1 至 5 之间均匀采样。针对部分方程,为避免非物理结果(如除零或负数的平方根),采样范围略有调整。输出结果通过函数 f 计算得出,例如 D={xiRd,yi=f(x1,,xd)}




4 The equation number II.11.17 is missing in the benchmark repository.
4 基准库中缺失方程 II.11.17 的编号。





This benchmark is rich in proposing various theoretical formulae. Still, it suffers a few limitations: (1) there is no distinction between variables and constants, i.e., constants are randomly sampled and, in some cases, in domains extremely far from physical values. For example, the speed of light is sampled from a uniform distribution U(1,20) whereas its physical value is orders of magnitude higher,i.e., c=2.988×108m/s ,and the gravitational constant is sampled from U(1,2) whereas its physical value is orders of magnitude smaller, G=6.6743×1011m3kg1s2 ,among others (e.g.,vacuum permittivity ϵ1012 ,Boltzmann constant kb1023 ,Planck constant h1034 ). (2) Some variables are sampled in nonphysical ranges. For example, the gravitational force is defined between two masses distant by r as F=Gm1m2/r2 . This force is weak unless defined between significantly massive objects (e.g.,the mass of the earth is Me=5.9722×1024kg ) whereas m1 and m2 are sampled in U(1,5) in the Feynman database. (3) Some variables are treated as floats while they are integers, and (4) many equations are duplicates of each other (e.g., a multiplicative function of two variables f(x,y)=xy) or have similar functional forms.
该基准测试集在提出各类理论公式方面内容丰富,但仍存在若干局限性:(1) 未区分变量与常量,即常量被随机采样,某些情况下其取值域与物理实际值相差甚远。例如,光速从均匀分布 U(1,20) 中采样,而其物理值高出数个数量级,即 c=2.988×108m/s ;引力常数采样自 U(1,2) ,而其物理值则小数个数量级,为 G=6.6743×1011m3kg1s2 ,其他例子还包括真空介电常数 ϵ1012 、玻尔兹曼常数 kb1023 、普朗克常数 h1034 等。(2) 部分变量采样范围不符合物理实际。例如,两质量间距离 r 时的引力定义为 F=Gm1m2/r2 。此力除非作用于极大质量物体(如地球质量为 Me=5.9722×1024kg )否则较弱,而 Feynman 数据库中 m1m2 的采样范围却是 U(1,5) 。(3) 某些本应为整数的变量被当作浮点数处理;(4) 许多方程存在重复,例如两变量的乘积函数 f(x,y)=xy) 或具有相似函数形式。

The ODE-Strogatz repository [60] consists of ten physics equations that describe the behavior of dynamical systems which can exhibit chaotic and/or non-linear behavior. Each dataset is one state of a two-state system of ordinary differential equations.
ODE-Strogatz 代码库[60]包含十个描述动力系统行为的物理方程,这些系统可能表现出混沌和/或非线性行为。每个数据集均为一个由常微分方程构成的双态系统的一个状态。

Within the same category, there exist several benchmarks [1, 14, 15, 16, 17, 18, 19] consisting of real-valued symbolic functions. The majority of these benchmarks are proposed for GP-based methods and grouped into four categories: polynomial, trigonometric, logarithmic, exponential, and square-root functions, and a combination of univariate and bivariate functions. The suggested functions do not have any physical meaning, and most depend either on one or two independent variables. Datasets are generally generated by randomly sampling either 20 or 100 points in narrow ranges. The most commonly known is the so-called Nguyen benchmark, which consists of 12 symbolic functions taken from [65, 66, 67]. Only four equations have the scalars {1,2,1/2} as constants therein. Each benchmark is defined by a ground-truth expression, training, and test datasets. The equations proposed in these benchmarks can not be found in a single repository. Therefore we list them in the appendix in Tables 7.9 and Tables 10-11 for completeness and for easy comparison.
在同一类别中,存在多个由实值符号函数组成的基准测试集[1, 14, 15, 16, 17, 18, 19]。这些基准大多针对基于 GP 的方法提出,并分为四类:多项式、三角函数、对数函数、指数函数与平方根函数,以及单变量与双变量函数的组合。所建议的函数不具备任何物理意义,且多数仅依赖于一个或两个独立变量。数据集通常通过在狭窄范围内随机采样 20 或 100 个点生成。其中最广为人知的是所谓的 Nguyen 基准,它包含取自[65, 66, 67]的 12 个符号函数,其中仅四个方程含有标量 {1,2,1/2} 作为常量。每个基准均由真实表达式、训练集和测试集定义。这些基准中提出的方程无法在单一存储库中找到,因此为完整性和便于比较,我们在附录的表 7.9 及表 10-11 中列出了它们。

Real-world problems are characterized by an unknown model that underlies data. This category comprises two groups: observations and measurements. Data sets in the observations category can originate from any domain, such as health informatics, environmental science, business, commerce, etc. Data could be collected online or offline from reports or studies. A wide range of problems can be assessed from the following repositories: the PMLB [68], the OpenML [69], and the UCI [70]. An exemplary application in this category is wind speed forecasting [71]. Measurements represent sets of data points that are collected (and sometimes analyzed) in physics experiments. Here the target model is either an underlying theory than can be derived from first principles or not. In the first case, symbolic regression would either infer the correct model structure and parameters or contribute to the theory development of the studied process, whereas in the second case, the symbolic regression output could be the awaited theory.
现实世界的问题往往由数据背后未知的模型所定义。此类问题可分为两组:观测数据与测量数据。观测类数据集可源自任何领域,如健康信息学、环境科学、商业贸易等。这些数据可能通过线上或线下方式从报告或研究中收集。通过以下资源库可评估各类问题:PMLB [68]、OpenML [69] 和 UCI [70]。此类应用的典型范例是风速预测[71]。测量数据则代表物理学实验中收集(有时会分析)的数据点集合。此时目标模型要么是基于第一性原理可推导的理论,要么尚未明确。前者情况下,符号回归既可推断出正确模型结构与参数,又能推动所研究过程的理论发展;后者情况下,符号回归的输出可能成为期待中的理论。

9 Discussion  9 讨论


SR is a growing area of ML and is gaining more attention as interpretability is increasingly promoted [3] in AI applications. SR is propelled by the fact that ML models are becoming very big in parameters at the expense of making accurate predictions. An exemplary application is the chatGPT-4, a large language model comprising hundreds of billions of parameters and trained on hundreds of terabytes of textual data. Such big models are very complicated networks. ChatGPT-4, for example, is accomplishing increasingly complicated and intelligent tasks to the point that it is showing emergent properties [72]. However, it is not straightforward to understand when it works and, more importantly, when it does not. In addition, its performance improves with increasing the number of parameters, highlighting that its prediction accuracy depends on the size of the training data set. Therefore, a new paradigm is needed, especially in scientific disciplines, such as physical sciences, where problems are of causal hypothesis-driven nature. SR is by far the most potential candidate to fulfill the interpretability requirements and is expected to play a central role in the future of ML.
SRML 中一个不断发展的领域,随着可解释性在 AI 应用中被日益强调[3],它正获得更多关注。SR 的推动力在于,机器学习模型为了做出准确预测,参数规模变得非常庞大。一个典型应用是 chatGPT-4,这个大型语言模型包含数千亿参数,并基于数百 TB 文本数据进行训练。此类巨型模型是极其复杂的网络。以 ChatGPT-4 为例,它正在完成越来越复杂和智能的任务,甚至展现出涌现特性[72]。然而,要理解其何时有效运作——更重要的是何时失效——并非易事。此外,其性能随着参数数量增加而提升,这表明预测精度取决于训练数据集的规模。因此,在物理科学等以因果假设驱动为本质的科学领域,亟需新范式。目前 SR 是实现可解释性要求最具潜力的候选方案,预计将在机器学习未来发展中扮演核心角色。

Despite the significant advances made in this subfield and the high performance of most deep learning-based SR methods proposed in the literature, still, SR methods fail to recover relatively simple relationships. A case in point is the Nguyen-12 expression,i.e., f(x,y)=x4x3+y2/2y ,where x and y are uniformly sampled in the range [0,1] . The NGPPS method could not recover this particular expression using the library basis L={+,,×,÷,sin,cos,exp,log,x,y} . A variant of this expression,Nguyen-12*,consisting of the same equation but defined over a larger domain,i.e.,data points sampled in [0,10] ,was successfully covered using the same library,with a recovery rate of 12% . This result is significantly below the perfect performance on all other Nguyen expressions. A similar observation is made for the Livermore-5whose expression is f(x,y)=x4x3+x2y .
尽管该子领域已取得显著进展,且文献中提出的大多数基于深度学习的 SR 方法表现出高性能,但 SR 方法仍无法恢复相对简单的关系。一个典型例子是 Nguyen-12 表达式,即 f(x,y)=x4x3+y2/2y ,其中 xy[0,1] 范围内均匀采样。NGPPS 方法无法使用库基 L={+,,×,÷,sin,cos,exp,log,x,y} 恢复这一特定表达式。该表达式的一个变体 Nguyen-12*(由相同方程构成但定义在更大域上,即在 [0,10] 采样的数据点)使用相同库成功覆盖,恢复率为 12% 。这一结果明显低于所有其他 Nguyen 表达式的完美表现。对于表达式为 f(x,y)=x4x3+x2y 的 Livermore-5 也观察到类似现象。


We ran NGPPS on Nguyen-12 with two libraries,a pure polynomial basis L1={+,,×,÷,()2,()3,()4,x,y} and a mixed basis L2=L1{sin,cos,exp,log, sqrt,expneg } . The algorithm succeeds in recovering Nguyen-12 only using a pure polynomial basis with a recovery rate of 3% . The same observation is made by applying linear SR on Nguyen-12. This highlights how strongly the predicted expression depends on the set of allowable mathematical operations. A practical way to encounter this limitation is to implement basic domain knowledge in SR applications whenever possible. For example, astronomical data collected by detecting the light curves of astronomical objects exhibit periodic behavior. In such cases, periodic functions such as trigonometric functions should be part of the library basis.
我们在 Nguyen-12 上运行了 NGPPS,使用了两个库:纯多项式基 L1={+,,×,÷,()2,()3,()4,x,y} 和混合基 L2=L1{sin,cos,exp,log, sqrt,expneg } 。该算法仅使用纯多项式基就成功恢复了 Nguyen-12,恢复率为 3% 。同样的观察结果也出现在对 Nguyen-12 应用线性 SR 时。这突显了预测表达式对允许的数学操作集的强烈依赖性。在实际应用中,应对这一限制的一种实用方法是在 SR 应用中尽可能实现基本的领域知识。例如,通过检测天体光变曲线收集的天文数据表现出周期性行为。在这种情况下,三角函数等周期函数应作为库基的一部分。

Most SR methods are only applied to synthetic data for which the input-output relationship is known. This is justified because the methods must be cross-checked, and their performance must be evaluated using ground-truth expressions. However, the reported results are for synthetic data only. To the best of our knowledge, only one physics application [34] succeeded in extracting New's laws of gravitation by applying SR to astronomical data. The absence of such applications leads us to state that SR is still a relatively nascent area with the potential to make a big impact. Physics in general, and physical sciences in particular, represent a very broad field for SR development purposes and are very rich both in data and expressions, e.g., areas such as astronomy and high-energy physics are very rich in data. In addition, lots of our acquired knowledge in physics can be used for SR methods test purposes because underlying phenomena and equations are well known. All that is needed is greater effort and investment.
大多数符号回归(SR)方法仅应用于已知输入输出关系的合成数据。这种做法是合理的,因为这些方法必须经过交叉验证,且其性能需通过真实表达式来评估。然而,所报告的结果仅针对合成数据。据我们所知,仅有一项物理学应用[34]通过将 SR 应用于天文数据成功推导出了牛顿万有引力定律。此类应用的缺乏使我们认为 SR 仍是一个相对新兴的领域,具有产生重大影响的潜力。物理学整体,尤其是物理科学,为 SR 的发展提供了极其广阔的领域,不仅数据丰富,表达式也多样,例如天文学和高能物理等领域就拥有海量数据。此外,我们在物理学中积累的大量知识可用于 SR 方法的测试,因为相关现象和基础方程已广为人知。当前最需要的是更多的努力和投入。

10 Conclusion  10 结论


This work presents an in-depth introduction to the symbolic regression problem and an expansive review of its methodologies and state-of-the-art applications. Also, this work highlights a number of conclusions that can be made about symbolic regression methods, including (1) linear symbolic regression suffer many limitations, all originating from predefining the model structure, (2) neural network-based methods lead to numerical issues and the library can not include all mathematical operations, (3) expression tree-based methods are yet the most powerful in terms of model performance on synthetic data, in particular transformer-based ones, (4) model predictions strongly depend on the set of allowable operations in the library basis, and (5) generally, deep learning-based methods are performing better than other ML-based methods.
本研究深入介绍了符号回归问题,并全面综述了其方法论与前沿应用。同时,本文总结了关于符号回归方法的若干结论,包括:(1) 线性符号回归存在诸多局限性,其根源均在于预定义模型结构;(2) 基于神经网络的方法会导致数值问题,且运算库无法涵盖所有数学运算;(3) 基于表达式树的方法(尤其是基于 Transformer 的)在合成数据上的模型性能目前最具优势;(4) 模型预测高度依赖运算库基函数中允许的操作集合;(5) 总体而言,基于深度学习的方法优于其他基于机器学习的方法。

Symbolic regression represents a powerful tool for learning interpretable models in a data-driven manner. Its application is likely to grow in the future because it balances prediction accuracy and interpretability. Despite the limited SR application to real data, the few existing ones are very promising. A potential path to boost progress in this subfield is to apply symbolic regression to experimental data in physics.
符号回归是一种以数据驱动方式学习可解释模型的强大工具。由于其在预测准确性与可解释性之间取得了平衡,未来其应用范围有望进一步扩大。尽管目前符号回归在实际数据中的应用有限,但现有的少数案例已展现出巨大潜力。推动该子领域发展的一个潜在路径是将符号回归应用于物理实验数据中。

References  参考文献


[1] J. R. Koza, "Genetic programming as a means for programming computers by natural selection," Proceedings of the National Academy of Sciences, vol. 4, no. 2, pp. 87-112, 1994.
[1] J. R. Koza,《通过自然选择编程计算机的遗传编程》,《美国国家科学院院刊》,第 4 卷,第 2 期,第 87-112 页,1994 年。

[2] M. Schmidt and H. Lipson, "Distilling free-form natural laws from experimental data," Science, vol. 324, no. 5923, pp. 81-85, 2009.
[2] M. Schmidt 与 H. Lipson,《从实验数据中提炼自由形式的自然法则》,《科学》,第 324 卷,第 5923 期,第 81-85 页,2009 年。

[3] C. Rudin, "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead," Nature Machine Intelligence, vol. 1, no. 5, pp. 206-215, 2019.
[3] C.鲁丁,《停止为高风险决策解释黑箱机器学习模型,转而使用可解释模型》,《自然·机器智能》,第 1 卷第 5 期,第 206-215 页,2019 年。

[4] M. Mozaffari-Kermani, S. Sur-Kolay, A. Raghunathan, and N. K. Jha, "Systematic poisoning attacks on and defenses for machine learning in healthcare," IEEE Journal of Biomedical and Health Informatics, vol. 19, no. 6, pp. 1893-1905, 2015.
[4] M.莫扎法里-科尔马尼、S.苏尔-科莱、A.拉古纳坦与 N.K.贾,《医疗保健领域机器学习系统性的投毒攻击与防御》,《IEEE 生物医学与健康信息学杂志》,第 19 卷第 6 期,第 1893-1905 页,2015 年。

[5] J. Kepler, Epitome astronomiae Copernicanae. in: Noscemus Wiki.
[5] J.开普勒,《哥白尼天文学概要》,收录于:Noscemus 维基。

[6] I. Newton, A. Motte, and J. Machin, The Mathematical Principles of Natural Philosophy. No. Volume 1 in The Mathematical Principles of Natural Philosophy, London: B. Motte, 1729.
[6] I.牛顿、A.莫特与 J.马钦,《自然哲学的数学原理》,《自然哲学的数学原理》第 1 卷,伦敦:B.莫特出版社,1729 年。

[7] D. Gerwin, "Information processing, data inferences, and scientific generalization," Systems Research and Behavioral Science, vol. 19, pp. 314-325, 1974.
[7] D. 格温,《信息处理、数据推断与科学归纳》,《系统研究与行为科学》,第 19 卷,第 314-325 页,1974 年。

[8] P. Langley, "Data-driven discovery of physical laws," Cognitive Science, vol. 5, no. 1, pp. 31-54, 1981.
[8] P. 兰利,《数据驱动的物理定律发现》,《认知科学》,第 5 卷第 1 期,第 31-54 页,1981 年。


[9] B. C. Falkenhainer and R. S. Michalski, "Integrating quantitative and qualitative discovery: The abacus system," Mach. Learn., vol. 1, p. 367-401, mar 1986.
[9] B. C. 福肯海纳与 R. S. 米哈尔斯基,《定量与定性发现的融合:ABACUS 系统》,《机器学习》,第 1 卷,第 367-401 页,1986 年 3 月。

[10] L. P.W., "Bacon: A production system that discovers empirical laws," 1979.
[10] L. P.W.,《培根:一个发现经验法则的生产系统》,1979 年。

[11] P. Langley, H. A. Simon, G. L. Bradshaw, and J. M. Zytkow, Scientific Discovery: Computational Explorations of the Creative Process. Cambridge, MA, USA: MIT Press, 1987.
[11] P. Langley, H. A. Simon, G. L. Bradshaw, J. M. Zytkow 著,《科学发现:创造性过程的计算探索》。美国马萨诸塞州剑桥:麻省理工学院出版社,1987 年。

[12] J. R. Koza, "Hierarchical genetic algorithms operating on populations of computer programs," in Proceedings of the 11th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI'89, (San Francisco, CA, USA), p. 768-774, Morgan Kaufmann Publishers Inc., 1989.
[12] J. R. Koza,“操作于计算机程序群体的分层遗传算法”,收录于《第 11 届国际人工智能联合会议论文集第一卷》,IJCAI'89,(美国加利福尼亚州旧金山),第 768-774 页,Morgan Kaufmann 出版社,1989 年。

[13] J. R. Koza, "Genetic programming: A paradigm for genetically breeding populations of computer programs to solve problems," tech. rep., Stanford, CA, USA, 1990.
[13] J. R. Koza,“遗传编程:一种通过基因培育计算机程序群体以解决问题的范式”,技术报告,美国加利福尼亚州斯坦福,1990 年。

[14] M. Keijzer, "Improving symbolic regression with interval arithmetic and linear scaling," in Genetic Programming (C. Ryan, T. Soule, M. Keijzer, E. Tsang, R. Poli, and E. Costa, eds.), (Berlin, Heidelberg), pp. 70-82, Springer Berlin Heidelberg, 2003.
[14] M. Keijzer,《通过区间算术和线性缩放改进符号回归》,载于《遗传编程》(C. Ryan、T. Soule、M. Keijzer、E. Tsang、R. Poli 和 E. Costa 编),柏林/海德堡,第 70-82 页,Springer Berlin Heidelberg,2003 年。

[15] E. Vladislavleva, G. Smits, and D. den Hertog, "Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming," IEEE Transactions on Evolutionary Computation, vol. 13, pp. 333-349, 2009.
[15] E. Vladislavleva、G. Smits 和 D. den Hertog,《非线性阶数作为通过帕累托遗传编程生成的模型复杂度度量》,IEEE 进化计算汇刊,第 13 卷,第 333-349 页,2009 年。

[16] M. F. Korns, Accuracy in Symbolic Regression, pp. 129-151. New York, NY: Springer New York, 2011.
[16] M. F. Korns 著,《符号回归中的准确性》,第 129-151 页。美国纽约:Springer 纽约公司,2011 年。

[17] N. Q. Uy, N. X. Hoai, M. O'Neill, R. I. McKay, and E. G. López, "Semantically-based crossover in genetic programming: application to real-valued symbolic regression," Genetic Programming and Evolvable Machines, vol. 12, pp. 91-119, 2010.
[17] N. Q. Uy、N. X. Hoai、M. O'Neill、R. I. McKay 和 E. G. López 合著,《遗传编程中的语义交叉操作:在实数符号回归中的应用》,发表于《Genetic Programming and Evolvable Machines》期刊,第 12 卷,第 91-119 页,2010 年。

[18] Y. Jin, W. Fu, J. Kang, J. Guo, and J. Guo, "Bayesian symbolic regression," 2019.
[18] Y. Jin、W. Fu、J. Kang、J. Guo 和 J. Guo 合著,《贝叶斯符号回归》,2019 年。

[19] B. K. Petersen, "Deep symbolic regression: Recovering mathematical expressions from data via policy gradients," CoRR, vol. abs/1912.04871, 2019.
[19] B. K. Petersen 著,《深度符号回归:通过策略梯度从数据中恢复数学表达式》,发表于《CoRR》期刊,卷号 abs/1912.04871,2019 年。

[20] T. McConaghy, FFX: Fast, Scalable, Deterministic Symbolic Regression Technology, pp. 235-260. New York, NY: Springer New York, 2011.
[20] T. McConaghy 著,《FFX:快速、可扩展、确定性符号回归技术》,第 235-260 页,纽约,NY:Springer New York 出版社,2011 年。

[21] M. Virgolin, T. Alderliesten, C. Witteveen, and P. A. N. Bosman, "A model-based genetic programming approach for symbolic regression of small expressions," CoRR, vol. abs/1904.02050, 2019.
[21] M. 维尔戈林、T. 阿尔德利斯特恩、C. 维特文与 P. A. N. 博斯曼,《基于模型的遗传编程方法在小规模符号回归中的应用》,CoRR,卷 abs/1904.02050,2019 年。

[22] F. O. de França and G. S. I. Aldeia, "Interaction-transformation evolutionary algorithm for symbolic regression," CoRR, vol. abs/1902.03983, 2019.
[22] F. O. 德弗兰萨与 G. S. I. 阿尔代亚,《符号回归的交互转换进化算法》,CoRR,卷 abs/1902.03983,2019 年。

[23] I. Arnaldo, K. Krawiec, and U.-M. O'Reilly, "Multiple regression genetic programming," in Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO '14, (New York, NY, USA), p. 879-886, Association for Computing Machinery, 2014.
[23] I. 阿纳尔多、K. 克拉维茨与 U.-M. 奥莱利,《多元回归遗传编程》,载《2014 年遗传与进化计算年会论文集》,GECCO '14,(美国纽约州纽约市),第 879-886 页,计算机器协会,2014 年。

[24] W. G. L. Cava, T. R. Singh, J. Taggart, S. Suri, and J. H. Moore, "Stochastic optimization approaches to learning concise representations," CoRR, vol. abs/1807.00981, 2018.
[24] W. G. L. 卡瓦、T. R. 辛格、J. 塔格特、S. 苏里与 J. H. 摩尔,《学习简洁表示的随机优化方法》,CoRR,卷 abs/1807.00981,2018 年。

[25] N. Makke and S. Chawla, "A living review of symbolic regression," https://github.com/nmakke/SR-LivingReview, 2022.
[25] N. Makke 与 S. Chawla,《符号回归的活态综述》,https://github.com/nmakke/SR-LivingReview,2022 年。

[26] R. Dubcakova, "Eureqa: software review," Genetic Programming and Evolvable Machines, vol. 12, pp. 173- 178, June 2011.
[26] R. Dubcakova,《Eureqa:软件评测》,遗传编程与可进化机器,第 12 卷,第 173-178 页,2011 年 6 月。

[27] S.-M. Udrescu and M. Tegmark, "Ai feynman: a physics-inspired method for symbolic regression," 2019.
[27] S.-M. Udrescu 与 M. Tegmark,《AI Feynman:一种受物理学启发的符号回归方法》,2019 年。

[28] G. Martius and C. H. Lampert, "Extrapolation and learning equations," CoRR, vol. abs/1610.02995, 2016.
[28] G. Martius 与 C. H. Lampert,《外推与方程学习》,CoRR,卷 abs/1610.02995,2016 年。

[29] T. N. Mundhenk, M. Landajuela, R. Glatt, C. P. Santiago, D. M. Faissol, and B. K. Petersen, "Symbolic regression via neural-guided genetic programming population seeding," CoRR, vol. abs/2111.00053, 2021.
[29] T. N. Mundhenk、M. Landajuela、R. Glatt、C. P. Santiago、D. M. Faissol 与 B. K. Petersen,《通过神经引导遗传编程种群播种实现符号回归》,CoRR,卷 abs/2111.00053,2021 年。

[30] A. M. Alaa and M. van der Schaar, "Demystifying black-box models with symbolic metamodels," in Advances in Neural Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché- Buc, E. Fox, and R. Garnett, eds.), vol. 32, (New York), Curran Associates, Inc., 2019.
[30] A. M. Alaa 与 M. van der Schaar 合著,《用符号元模型揭秘黑箱模型》,收录于《神经信息处理系统进展》(H. Wallach、H. Larochelle、A. Beygelzimer、F. d'Alché-Buc、E. Fox 及 R. Garnett 编),第 32 卷,(纽约)Curran Associates 公司,2019 年。

[31] P.-A. Kamienny, S. d'Ascoli, G. Lample, and F. Charton, "End-to-end symbolic regression with transformers," 2022.
[31] P.-A. Kamienny、S. d'Ascoli、G. Lample 及 F. Charton 合著,《基于 Transformer 的端到端符号回归》,2022 年。


[32] L. Biggio, T. Bendinelli, A. Neitz, A. Lucchi, and G. Parascandolo, "Neural symbolic regression that scales," CoRR, vol. abs/2106.06427, 2021.
[32] L. Biggio、T. Bendinelli、A. Neitz、A. Lucchi 及 G. Parascandolo 合著,《可扩展的神经符号回归》,CoRR,卷 abs/2106.06427,2021 年。

[33] K. Champion, B. Lusch, J. N. Kutz, and S. L. Brunton, "Data-driven discovery of coordinates and governing
[33] K. Champion、B. Lusch、J. N. Kutz 及 S. L. Brunton 合著,《数据驱动的坐标与支配关系发现》

equations," Proceedings of the National Academy of Sciences, vol. 116, no. 45, pp. 22445-22451, 2019.
方程,《美国国家科学院院刊》,第 116 卷,第 45 期,第 22445-22451 页,2019 年。

[34] P. Lemos, N. Jeffrey, M. Cranmer, S. Ho, and P. Battaglia, "Rediscovering orbital mechanics with machine learning," 2022.
[34] P. 莱莫斯、N. 杰弗里、M. 克兰默、S. 何与 P. 巴塔利亚,《用机器学习重新发现轨道力学》,2022 年。

[35] R. Batra, L. Song, and R. Ramprasad, "Emerging materials intelligence ecosystems propelled by machine learning," Nature Reviews Materials, vol. 6, pp. 655-678, Nov. 2020.
[35] R. Batra, L. Song, 和 R. Ramprasad,“由机器学习推动的新兴材料智能生态系统”,《自然综述:材料》,第 6 卷,第 655-678 页,2020 年 11 月。

[36] A. Hernandez, A. Balasubramanian, F. Yuan, S. Mason, and T. Mueller, "Fast, accurate, and transferable many-body interatomic potentials by symbolic regression," 2019.
[36] A. Hernandez, A. Balasubramanian, F. Yuan, S. Mason, 和 T. Mueller,“通过符号回归实现快速、准确且可迁移的多体原子间势”,2019 年。

[37] Y. Wang, N. Wagner, and J. M. Rondinelli, "Symbolic regression in materials science," MRS Communications, vol. 9, pp. 793-805, sep 2019.
[37] Y. Wang, N. Wagner, 和 J. M. Rondinelli,“材料科学中的符号回归”,《MRS 通讯》,第 9 卷,第 793-805 页,2019 年 9 月。

[38] B. Weng, Z. Song, R. Zhu, Q. Yan, Q. Sun, C. G. Grice, Y. Yan, and W.-J. Yin, "Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts," Nature Communications, vol. 11, 2020.
[38] B. Weng, Z. Song, R. Zhu, Q. Yan, Q. Sun, C. G. Grice, Y. Yan, 和 W.-J. Yin,“源自符号回归的简单描述符加速新型钙钛矿催化剂的发现”,《自然通讯》,第 11 卷,2020 年。

[39] J. Martinez-Gil and J. M. Chaves-Gonzalez, "A novel method based on symbolic regression for interpretable semantic similarity measurement," Expert Systems with Applications, vol. 160, p. 113663, 2020.
[39] J. Martinez-Gil 与 J. M. Chaves-Gonzalez 合著,《一种基于符号回归的可解释语义相似度测量新方法》,《专家系统及其应用》,第 160 卷,第 113663 页,2020 年。

[40] I. A. Abdellaoui and S. Mehrkanoon, "Symbolic regression for scientific discovery: an application to wind speed forecasting," 2021 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 01-08, 2021.
[40] I. A. 阿卜杜拉胡伊与 S. 梅尔卡努恩合著,《符号回归在科学发现中的应用:以风速预测为例》,2021 年 IEEE 计算智能研讨会系列(SSCI),第 01-08 页,2021 年。

[41] M. Virgolin, Z. Wang, T. Alderliesten, and P. A. N. Bosman, "Machine learning for the prediction of pseudorealistic pediatric abdominal phantoms for radiation dose reconstruction," Journal of Medical Imaging, vol. 7, no. 4, p. 046501, 2020.
[41] M. 维尔戈林、Z. 王、T. 阿尔德利斯特恩与 P. A. N. 博斯曼合著,《机器学习用于预测辐射剂量重建中的拟真实儿科腹部模型》,《医学影像杂志》,第 7 卷第 4 期,第 046501 页,2020 年。

[42] W. G. L. Cava, P. Orzechowski, B. Burlacu, F. O. de França, M. Virgolin, Y. Jin, M. Kommenda, and J. H. Moore, "Contemporary symbolic regression methods and their relative performance," CoRR, vol. abs/2107.14351, 2021.
[42] W. G. L. 卡瓦、P. 奥尔泽霍夫斯基、B. 布尔拉克、F. O. 德弗兰萨、M. 维尔戈林、Y. 金、M. 科门达与 J. H. 摩尔,《当代符号回归方法及其相对性能》,CoRR,卷 abs/2107.14351,2021 年。

[43] V. Vapnik, "Principles of risk minimization for learning theory," in Advances in Neural Information Processing Systems (J. Moody, S. Hanson, and R. Lippmann, eds.), vol. 4, (Cambridge, Massachusetts), Morgan-Kaufmann, 1991.
[43] V. 瓦普尼克,《学习理论中的风险最小化原则》,收录于《神经信息处理系统进展》(J. 穆迪、S. 汉森与 R. 利普曼编),第 4 卷,马萨诸塞州剑桥,摩根-考夫曼出版社,1991 年。

[44] M. Virgolin and S. P. Pissis, "Symbolic regression is np-hard," 2022.
[44] M. 维尔戈林与 S. P. 皮西斯,《符号回归是 NP 难问题》,2022 年。

[45] R. Robinson, "Jan Lukasiewicz: Aristotle's syllogistic from the standpoint of modern formal logic. second edition enlarged. pp. xvi 222. oxford: Clarendon press, 1957. cloth, 305. net.," The Classical Review, vol. 8, no. 3-4, p. 282-282, 1958.
[45] R. 罗宾逊,《扬·武卡谢维奇:从现代形式逻辑角度看亚里士多德的三段论。第二版增补本。共 xvi+222 页。牛津:克拉伦登出版社,1957 年。布面精装,定价 305 便士。》,《古典评论》,第 8 卷第 3-4 期,第 282 页,1958 年。

[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," CoRR, vol. abs/1706.03762, 2017.
[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, 和 I. Polosukhin, "注意力就是您所需的一切," CoRR, vol. abs/1706.03762, 2017.

[47] G. Martius and C. H. Lampert, "Extrapolation and learning equations," CoRR, vol. abs/1610.02995, 2016.
[47] G. Martius 和 C. H. Lampert, "外推与学习方程," CoRR, vol. abs/1610.02995, 2016.

[48] S. S. Sahoo, C. H. Lampert, and G. Martius, "Learning equations for extrapolation and control," CoRR, vol. abs/1806.07259, 2018.
[48] S. S. Sahoo, C. H. Lampert, 和 G. Martius, "用于外推与控制的方程学习," CoRR, vol. abs/1806.07259, 2018.

[49] M. Valipour, B. You, M. Panju, and A. Ghodsi, "Symbolicgpt: A generative transformer model for symbolic regression," CoRR, vol. abs/2106.14131, 2021.
[49] M. Valipour, B. You, M. Panju, 和 A. Ghodsi, "SymbolicGPT: 一种符号回归的生成式 Transformer 模型," CoRR, vol. abs/2106.14131, 2021.

[50] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018.
[50] R. S. 萨顿与 A. G. 巴托,《强化学习:导论》。美国马萨诸塞州剑桥:布拉德福德图书出版社,2018 年。

[51] S. L. Brunton, J. L. Proctor, and J. N. Kutz, "Discovering governing equations from data by sparse identification of nonlinear dynamical systems," Proceedings of the National Academy of Sciences, vol. 113, no. 15, pp. 3932-3937, 2016.
[51] S. L. 布伦顿、J. L. 普罗克特与 J. N. 库茨,“通过非线性动力系统的稀疏辨识从数据中发现控制方程”,《美国国家科学院院刊》,第 113 卷,第 15 期,第 3932-3937 页,2016 年。

[52] M. D. Cranmer, A. Sanchez-Gonzalez, P. W. Battaglia, R. Xu, K. Cranmer, D. N. Spergel, and S. Ho, "Discovering symbolic models from deep learning with inductive biases," CoRR, vol. abs/2006.11287, 2020.
[52] M. D. 克兰默、A. 桑切斯-冈萨雷斯、P. W. 巴塔利亚、R. 徐、K. 克兰默、D. N. 斯佩格尔与 S. 何,“利用归纳偏置从深度学习中发掘符号模型”,《CoRR》,卷 abs/2006.11287,2020 年。

[53] Heuristic and E. A. Laboratory.
[53] 启发式与 E. A. 实验室。


[54] W. La Cava, K. Danai, and L. Spector, "Inference of compact nonlinear dynamic models by epigenetic local search," Engineering Applications of Artificial Intelligence, vol. 55, pp. 292-306, 2016.
[54] W. La Cava, K. Danai, 与 L. Spector, 《通过表观遗传局部搜索推断紧凑非线性动态模型》, 《工程人工智能应用》, 卷 55, 页 292-306, 2016 年。

[55] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal Representations by Error Propagation, p. 318-362. Cambridge, MA, USA: MIT Press, 1986.
[55] D. E. Rumelhart, G. E. Hinton, 与 R. J. Williams, 《通过误差传播学习内部表示》, 页 318-362. 美国剑桥: MIT 出版社, 1986 年。

[56] B. Chen, K. Huang, S. Raghupathi, I. Chandratreya, Q. Du, and H. Lipson, "Discovering state variables hidden in experimental data," 2021.
[56] B. Chen, K. Huang, S. Raghupathi, I. Chandratreya, Q. Du, 与 H. Lipson, 《发现隐藏在实验数据中的状态变量》, 2021 年。

[57] C. Meijer, On the g-function. Netherlands: North-Holland, 1946.
[57] C. Meijer, 《论 g 函数》, 荷兰: 北荷兰出版社, 1946 年。

[58] R. Beals and J. Szmigielski, "Meijer g-functions: a gentle introduction," Notices of the American Mathematical Society, vol. 60, pp. 866-873, 2013.
[58] R. 比尔斯与 J. 斯米吉尔斯基,《Meijer G 函数:温和入门指南》,美国数学会通告,第 60 卷,第 866-873 页,2013 年。

[59] R. J. Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning," Mach. Learn., vol. 8, p. 229-256, may 1992.
[59] R. J. 威廉姆斯,《连接主义强化学习中简单的统计梯度跟随算法》,机器学习,第 8 卷,第 229-256 页,1992 年 5 月。

[60] W. La Cava, K. Danai, and L. Spector, "Inference of compact nonlinear dynamic models by epigenetic local search," Engineering Applications of Artificial Intelligence, vol. 55, pp. 292-306, 2016.
[60] W. La Cava、K. Danai 和 L. Spector 合著,《通过表观局部搜索推断紧凑非线性动态模型》,《人工智能工程应用》,第 55 卷,第 292-306 页,2016 年。

[61] K. Krawiec and T. Pawlak, "Approximating geometric crossover by semantic backpropagation," in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, GECCO '13, (New York, NY, USA), p. 941-948, Association for Computing Machinery, 2013.
[61] K. Krawiec 与 T. Pawlak 合著,《通过语义反向传播逼近几何交叉》,载于《第 15 届遗传与进化计算年会论文集》,GECCO '13,(美国纽约州纽约市),第 941-948 页,计算机协会,2013 年。

[62] M. Tegmark, "The feynman symbolic regression database," May 2019.
[62] M. Tegmark 著,《费曼符号回归数据库》,2019 年 5 月。

[63] R. Feynman, R. Leighton, and M. Sands, The Feynman Lectures on Physics, Vol. I: The New Millennium Edition: Mainly Mechanics, Radiation, and Heat. The Feynman Lectures on Physics, New York: Basic Books, 2011.
[63] R·费曼, R·莱顿, 与 M·桑兹合著,《费曼物理学讲义 第一卷:新千年版:主要涵盖力学、辐射与热学》。费曼物理学讲义系列,纽约:Basic Books 出版社,2011 年。

[64] R. Feynman, R. Leighton, M. Sands, and M. Gottlieb, The Feynman Lectures on Physics. No. Volume 2 in The Feynman Lectures on Physics, Boston: Pearson/Addison-Wesley, 2006.
[64] R·费曼, R·莱顿, M·桑兹, 及 M·戈特利布合著,《费曼物理学讲义》。费曼物理学讲义系列之第二卷,波士顿:Pearson/Addison-Wesley 出版社,2006 年。

[65] M. Keijzer, "Improving symbolic regression with interval arithmetic and linear scaling," in Genetic Programming (C. Ryan, T. Soule, M. Keijzer, E. Tsang, R. Poli, and E. Costa, eds.), (Berlin, Heidelberg), pp. 70-82, Springer Berlin Heidelberg, 2003.
[65] M·凯泽著,"利用区间算术与线性缩放改进符号回归",收录于《遗传编程》(C·瑞安, T·索尔, M·凯泽, E·曾, R·波利, 与 E·科斯塔编),柏林/海德堡:Springer Berlin Heidelberg 出版社,2003 年,第 70-82 页。

[66] N. Hoai, R. McKay, D. Essam, and R. Chau, "Solving the symbolic regression problem with tree-adjunct grammar guided genetic programming: the comparative results," in Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600), vol. 2, pp. 1326-1331 vol.2, 2002.
[66] N·怀, R·麦凯, D·埃萨姆, 与 R·周合著,"基于树邻接语法引导遗传编程的符号回归问题求解:对比实验结果",发表于《2002 年进化计算大会论文集》。CEC'02(分类号 02TH8600),第 2 卷,2002 年,第 1326-1331 页。

[67] C. G. Johnson, "Genetic programming crossover: Does it cross over?," in Genetic Programming (L. Van-neschi, S. Gustafson, A. Moraglio, I. De Falco, and M. Ebner, eds.), (Berlin, Heidelberg), pp. 97-108, Springer Berlin Heidelberg, 2009.
[67] C·G·约翰逊,《遗传编程中的交叉操作:是否真正实现交叉?》,载于《遗传编程》(L·范内斯基、S·古斯塔夫森、A·莫拉利奥、I·德法尔科、M·埃布纳编),柏林/海德堡:施普林格出版社,2009 年,第 97-108 页。

[68] R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, and J. H. Moore, "Pmlb: a large benchmark suite for machine learning evaluation and comparison," BioData Mining, vol. 10, pp. 1-13, Dec 2017.
[68] R·S·奥尔森、W·拉卡瓦、P·奥尔泽霍夫斯基、R·J·乌尔班诺维奇、J·H·摩尔,《PMLB:用于机器学习评估与对比的大规模基准测试集》,《生物数据挖掘》,第 10 卷,第 1-13 页,2017 年 12 月。

[69] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, "Openml: networked science in machine learning," SIGKDD Explorations, vol. 15, no. 2, pp. 49-60, 2013.
[69] J·范绍伦、J·N·范里恩、B·比施尔、L·托尔戈,《OpenML:机器学习中的网络化科学》,《SIGKDD 探索》,第 15 卷第 2 期,第 49-60 页,2013 年。

[70] D. Dua and C. Graff, "UCI machine learning repository," 2017.
[70] D·杜瓦、C·格拉夫,《UCI 机器学习数据库》,2017 年。

[71] I. A. Abdellaoui and S. Mehrkanoon, "Symbolic regression for scientific discovery: an application to wind speed forecasting," 2021.
[71] I. A. Abdellaoui 和 S. Mehrkanoon,《符号回归在科学发现中的应用:以风速预测为例》,2021 年。

[72] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, "Emergent abilities of large language models," 2022.
[72] J. Wei、Y. Tay、R. Bommasani、C. Raffel、B. Zoph、S. Borgeaud、D. Yogatama、M. Bosma、D. Zhou、D. Metzler、E. H. Chi、T. Hashimoto、O. Vinyals、P. Liang、J. Dean 和 W. Fedus,《大型语言模型的涌现能力》,2022 年。

[73] U.-M. O'Reilly, "Genetic programming ii: Automatic discovery of reusable programs.," Artificial Life, vol. 1, no. 4, pp. 439-441, 1994.
[73] U.-M. O'Reilly,《遗传编程 II:可重用程序的自动发现》,《人工生命》,第 1 卷,第 4 期,第 439-441 页,1994 年。


A Datasets Benchmarks Equations
数据集 基准测试 方程



Table 7: Ground-truth expressions for Koza [73, Nguyen [17], Jin [18], Keijzer [14] and R 61 benchmarks.
表 7:Koza [73]、Nguyen [17]、Jin [18]、Keijzer [14]及 R 61 基准的真实表达式。

Dataset  数据集Expression  表达Variables  变量Data range  数据范围
Koza-1 5  科扎-1 5 x4+x3+x2+x1U [1,1,20]
Koza-2  科扎-2x52x3+x1U [1,1,20]
Koza-3  科扎-3x62x4+x21U [1,1,20]
Nguyen-1  阮氏-1x3+x2+x1U(-1,1,20)  均匀分布(-1,1,20)
Nguyen-2  阮氏-2x4+x3+x2+x1U(-1,1,20)  均匀分布(-1,1,20)
Nguyen-3  阮氏-3x5+x4+x3+x2+x1U(-1,1,20)  均匀分布(-1,1,20)
Nguyen-4x6+x5+x4+x3+x2+x1U(-1,1,20)  均匀分布(-1,1,20)
Nguyen-5  阮氏-5sin(x2)cos(x)11U(-1,1,20)  均匀分布(-1,1,20)
Nguyen-6  阮氏-6sin(x)+sin(x+x2)1U(-1,1,20)  均匀分布(-1,1,20)
Nguyen-7  阮氏-7log(x+1)+log(x2+1)1U(0,2,20)  均匀分布(0,2,20)
Nguyen-8  阮氏-8x1U(0,4,20)  均匀分布(0,4,20)
Nguyen-9  阮氏-9sin(x)+sin(y2)2U(-1,1,100)  均匀分布(-1,1,100)
Nguyen-10  阮氏-102sin(x)cos(y)2U(-1,1,100)  均匀分布(-1,1,100)
Nguyen-11  阮-11xy2
Nguyen-12  阮-12x4x3+12y2y2
Jin-1  金-12.5x41.3x3+0.5y21.7y2U(-3,3,100)
Jin-2  金-28.0x2+8.0y315.02U(-3,3,100)  均匀分布(-3,3,100)
Jin-3  金-30.2x3+1.5y31.2y0.5x2U(-3,3,100)  均匀分布(-3,3,100)
Jin-4  金-41.5exp(x)+5.0cos(y)2U(-3,3,100)  均匀分布(-3,3,100)
Jin-5  金-56.0sin(x)cos(y)2U(-3,3,100)  均匀分布(-3,3,100)
Jin-6  金-61.35xy+5.5sin((x1.0)(y1.0)2U(-3,3,100)
Keijzer-1  凯泽-10.3xsin(2πx)1E [1,1,0.1]
Keijzer-2  凯泽-20.3xsin(2πx)1E [2,2,0.1]
Keijzer-3  凯泽-30.3xsin(2πx)1E [3,3,0.1]
Keijzer-4  凯泽-4x3excos(x)sin(x)(sin2(x)cos(x)1)1E[0,10,0.05]
Keijzer-5  凯泽-530xz/(x10)y23x,z:U[1,1,1000] y:U[1,2,1000]
Keijzer-6  凯泽-61xi1E[1,50,1]
Keijzer-7  凯泽-7logx1E [1,100,1]
Keijzer-8  凯泽-8x1E [0,100,1]
Keijzer-9  凯泽-9arcsinh(x)=log(x+x2+1)1E [0,100,1]
Keijzer-10  凯泽-10xy2U[0,1,100]
Keijzer-11  凯泽-11xy+sin((x1)(y1))2U[-3, 3, 20]  均匀分布[-3, 3, 20]
Keijzer-12  凯泽-12x4x3+y2/2y2U[-3, 3, 20]  均匀分布[-3, 3, 20]
Keijzer-13  凯泽-136sin(x)cos(y)2U [3,3,20]
Keijzer-14  凯泽-148/(2+x2+y2)2U [3,3,20]
Keijzer-15  凯泽-15x3/5+y3/2yx2U [3,3,20]
R1(x+1)3/(x2x+1)1E [1,1,20]
R2(x53x3+1)/(x2+1)1E [1,1,20]
R3(x6+x5)/(x4+x3+x2+x+1)1E [1,1,20]


Table 8: Ground-truth expressions for Korns [16] and Livermore [19] benchmarks.
表 8:Korns [16] 和 Livermore [19] 基准测试的真实表达式。

Dataset  数据集Expression  表达Variables  变量Data range  数据范围
Korns-1  科恩斯-11.57+(24.3v)1U[-50, 50, 10000]
Korns-2  科恩斯-20.23+14.2v+y3ω3U[-50, 50, 10000  均匀分布[-50, 50, 10000
Korns-3  科恩斯-35.41+4.9vx+y/w3ω4U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-4  科恩斯-42.3+0.13sin(z)1U[-50, 50, 10000  均匀分布[-50, 50, 10000
Korns-5  科恩斯-53+2.13ln(ω)1U[-50, 50, 10000  均匀分布[-50, 50, 10000
Korns-6  科恩斯-61.3+0.13x1U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-7  科恩斯-7213.80940889(1e0.54723748542x)1U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-8  科恩斯-86.87+117.23xvω3U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-9  科恩斯-9xln(y)ezv24U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-10  科恩斯-100.81+24.32y+3z24v3+5ω44U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-11  科恩斯-116.87+11cos(7.23x3)1U[-50, 50, 10000]  均匀分布[-50,50,10000]
Korns-12  科恩斯-1222.1cos(9.8x)sin(1.3ω)2U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-13  科恩斯-13323tan(x)tan(y)tan(z)tan(v)4U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-14  科恩斯-14224.2(cos(x)tan(y))tanh(z)sin(v)4U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Korns-15  科恩斯-15126tan(x)ey(ln(z)tan(v))4U[-50, 50, 10000]  均匀分布[-50, 50, 10000]
Livermore-1  利弗莫尔-11/3+x+sin(x2)1U[-10,10,1000]  均匀分布[-10,10,1000]
Livermore-2  利弗莫尔-2sin(x2)cos(x)21U[-1,1,20]  均匀分布[-1,1,20]
Livermore-3  利弗莫尔-3sin(x3)cos(x2)11U[-1,1,20]  均匀分布[-1,1,20]
Livermore-4  利弗莫尔-4log(x+1)+log(x2+1)+log(x)1U[0,2,20]
Livermore-5  利弗莫尔-5x4x3+x2y2U[0,1,20]
Livermore-6  利弗莫尔-64x4+3x3+2x2+x1U[-1,1,20]  均匀分布[-1,1,20]
Livermore-7  利弗莫尔-7sinh(x)1U[-1,1,20]  均匀分布[-1,1,20]
Livermore-8  利弗莫尔-8cosh(x)1U[-1,1,20]  均匀分布[-1,1,20]
Livermore-9  利弗莫尔-9x9+x8+x7+x6+x5+x4+x3+x2+x1U[-1,1,20]  均匀分布[-1,1,20]
Livermore-10  利弗莫尔-106sin(x)cos(y)2U[0,1,20]  均匀分布[0,1,20]
Livermore-11  利弗莫尔-11x2y2/(x+y)2U[-1,1,50]  均匀分布[-1,1,50]
Livermore-12  利弗莫尔-12x5/y32U[-1,1,50]  均匀分布[-1,1,50]
Livermore-13  利弗莫尔-13x1/31U [0,4,20]
Livermore-14  利弗莫尔-14x3+x2+x+sin(x)+sin(x2)1U [1,1,20]
Livermore-15  利弗莫尔-15x1/51U[0,4,20]
Livermore-16  利弗莫尔-16x2/51U [0,4,20]
Livermore-17  利弗莫尔-174sin(x)cos(y)2U[0,1,20]  均匀分布[0,1,20]
Livermore-18  利弗莫尔-18sin(x2)cos(x)51U[-1,1,20]  均匀分布[-1,1,20]
Livermore-19  利弗莫尔-19x5+x4+x2+x1U[-1,1,20]  均匀分布[-1,1,20]
Livermore-20  利弗莫尔-20exp(x2)1U[-1,1,20]  均匀分布[-1,1,20]
Livermore-21  利弗莫尔-21x8+x7+x6+x5+x4+x3+x2+x1U[-1,1,20]  均匀分布[-1,1,20]
Livermore-22  利弗莫尔-22exp(0.5x2)1U[-1,1,20]  均匀分布[-1,1,20]

Table 9: Ground-truth expressions for Vladislavleva [15] benchmark.
表 9:Vladislavleva [15]基准测试的真实表达式。

Dataset  数据集Expression  表达Variables  变量Data range  数据范围
Vladislavleva-1  弗拉季斯拉夫列娃-1e(x1)2 1.2+(y2.5)21U [0.3,4,100]
Vladislavleva-2  弗拉季斯拉夫列娃-2exx3(cosxsinx)(cosxsin2x1)2E [0.5,10,0.1]
Vladislavleva-3  弗拉季斯拉夫列娃-3exx3(cosxsinx)(cosxsin2x1)(y5)2x:E[0.05,10,0.1] y:E[0.05,10.05,2]
Vladislavleva-4  弗拉季斯拉夫列娃-4105+i=15(xi3)25U[0.05, 6.05, 1024]
Vladislavleva-5  弗拉季斯拉夫-530(x1)(z1)y2(x10)3x:U[0.05,2,300] y:U[1,2,300] z:U[0.05,2,300]
Vladislavleva-6  弗拉季斯拉夫-66sin(x)cos(y)2U [0.1,5.9,30]
Vladislavleva-7  弗拉季斯拉夫-7(x3)(y3)+2sin((x4)(y4))2U[0.05, 6.05, 300]
Vladislavleva-8  弗拉季斯拉夫-8(x3)4+(y3)3(y3)(y2)4+102U [0.05,6.05,50]





Function form  函数形式#v
f=exp(θ2/2)/(2π)1
f=exp((θ/σ)2/2)/((2π)σ)2
f=exp(((θθ1)/σ)2/2)/((2π)σ)3
d=(x2x1)2+(y2y1)2 ,_____ Gm1m2 Γ=1((x2x1)2+(y2y1)2+(z2z1)2 m=m01v2/c24 9 3
A=x1y1+x2y2+x3y36
F=μNn2
F=q1q2/(4πϵr2)4
Ef=q1r/(4πϵr3)3
F=q2Ef2
F=q(Ef+Bvsin(θ))5
K=1/2m(v2+u2+w2)4
U=Gm1m2(1r21r1)5
U=mgz3
U=12kspring x22
x=(xut)/1u2/c24
t=(tux/c2)/1u2/c24
p=m0v/1v2/c23
v=(u+v)/(1+uv/c2)3
r=(m1r1+m2r2)/(m1+m2)4
τ=rFsin(θ)3
L=mrvsin(θ) E=14m(ω2+ω02)x24 4
Ve=q/C2
θ1=arcsin(nsin(θ2))2
ff=1/(1d1+nd2)3
k=ω/c2
x=x12+x222x1x2cos(θ1θ2)4
I=I0sin2(nθ/2)/sin2(θ/2)3
θ=arcsin(λ/nd)3
P=q2a2/(6πϵc3)4
P=(1/2ϵcEf2)(8πr2/3)(ω4/(ω2ω02)2)6
ω=qvB/p4
ω=ω0/(1v/c)3
ω=(1+v/c)/1v2/c2ω0 E=hω I=I1+I2+2I1I2cos(δ)3 2 3
r=4πϵh¯2/(mq2)4
E=32pFV2
E=1/(γ1)pFV3
PF=nkbT/V4







Function form  函数形式#
n=n0exp(mgx/(kbT))6
Lrad=h¯ω3/(π2c2(exp(h¯ω/(kbT))1))5
v=mudrift qVe/d4
D=μekbT3
κ=1/(γ1)kbv/A4
E=nkbTln(V2V1)5
c=(γpr/ρ)3
E=mc2/1v2/c23
x=x1(cos(ωt)+αcos(ωt)2)4
P=κ(T2T1)A/d5
FE=Pwr/(4πr2)2
Ve=q/(4πϵr)3
Ve=14πϵpdcos(θ)/r24
Ef=34πϵpdz/r5x2+y26
Ef=34πϵpdcos(θ)sin(θ)/r34
E=35q2/(4πϵd)3
Eden =ϵEf2/22
Ef=σden/ϵ1/(1+χ)3
x=qEf/(m(ω02ω2))5
n=n0(1+pdEfcos(θ)/(kbT))6
P=nrhopd2Ef/(3kbT)5
P=nα/(1(nα/3))ϵEf4
θ=1+nα/(1(nα/3))2
B=1/(4πϵc2)2I/r4
ρc=ρc0/1v2/c23
j=ρc0v/1v2/c23
E=μMBcos(θ)3
E=pdEfcos(θ)3
Ve=q/(4πϵr(1v/c))5
k=ω2/c2π2/d23
FE=ϵcEf23
Eden=ϵE˙f22
I=qv/(2πr)3
μM=qvr/23
ω=gqB/(2m)4
μM=qh/(2m)3
E=gμMBJz/h5
M=nrhoμMtanh(μMB/(kbT))5
f=μmB/(kbT)+(μmα)/(ϵc2kbT)M8
E=μM(1+χ)B6
F=YAx/d4
μS=Y/(2(1+σ))2
E=h¯ω/(exp(h¯ω/(kbT))1)4





Table 11: Feynman physics equation [27].
表 11:费曼物理方程[27]。

Function form  函数形式#variables
n=1/(exp(h¯ω/(kbT))1)4
n=n0/(exp(μmB/(kbT))+exp(μmB/(kbT)))
ω=2μMB/h¯3
pγ=sin(Ent/h¯)23
pγ=(pdEft/h¯)sin((ωω0)t/2)2/((ωω0)t/2)26
E=μMBx2+By2+Bz23
L=nh¯2
v=2End2k/h¯4
I=I0(exp(qVe/(kbT))1)5
E=2U(1cos(kd))3
m=h¯2/(2End2)3
k=2πα/(nd)3
f=β(1+αcos(θ))3
E=mq4/(2(4πϵ)2h¯2)(1/n2)4
j=ρc0qAvec/m4