这是用户在 2025-5-7 20:27 为 file:///Users/zoe/Downloads/2107.14351v1.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Contemporary Symbolic Regression Methods and their Relative Performance
当代符号回归方法及其相对性能


William La Cava*  威廉·拉卡瓦*

Institute for Biomedical Informatics
生物医学信息研究所

University of Pennsylvania
宾夕法尼亚大学

lacava@upenn.edu

Patryk Orzechowski   帕特里克·奥热霍夫斯基

Institute for Biomedical Informatics
生物医学信息研究所

University of Pennsylvania
宾夕法尼亚大学

patryk.orzechowski@gmail.com

Bogdan Burlacu  博格丹·布尔拉库

Josef Ressel Center for Symbolic Regression
约瑟夫·瑞塞尔符号回归研究中心

University of Applied Sciences Upper Austria
上奥地利应用科技大学

bogdan.burlacu@fh-ooe.at

Fabrício Olivetti de França
法布里西奥·奥利维蒂·德·弗兰萨

Federal University of ABC
联邦大学 ABC

Santo Andre, Brazil  巴西,圣安德烈

folivetti@ufabc.edu.br

Marco Virgolin  马尔科·维尔戈林

Mechanics and Maritime Sciences
力学与海洋科学

Chalmers University of Technology
查尔姆斯理工大学

marco.virgolin@chalmers.se

Ying Jin  应瑾

Department of Statistics  统计系

Stanford University  斯坦福大学

ying531@stanford.edu

Michael Kommenda  迈克尔·科门达

Josef Ressel Center for Symbolic Regression
约瑟夫·雷塞尔符号回归研究中心

University of Applied Sciences Upper Austria
上奥地利应用科技大学

michael.kommenda@fh-ooe.at

Jason H. Moore  杰森·H·摩尔

Institute for Biomedical Informatics
生物医学信息研究所

University of Pennsylvania
宾夕法尼亚大学

jhmoore@upenn.edu

Abstract  摘要


Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. In this paper, we address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems, including physics equations and systems of ordinary differential equations. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that deep learning and genetic algorithm-based approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.
近年来,符号回归领域涌现了许多前景广阔的方法,然而该领域的发展仍因缺乏统一、稳健且透明的基准测试标准而受阻。本文通过引入一个开源、可复现的符号回归基准测试平台来解决这一不足。我们在 252 个多样化回归问题上评估了 14 种符号回归方法和 7 种机器学习方法。评估范围既包含无已知模型形式的真实世界数据集,也涵盖具有已知解析解的基准问题,包括物理方程和常微分方程组。针对真实数据集,我们以当前最优机器学习方法为参照,测试各方法在获得低误差、低复杂度模型方面的表现。对于合成问题,我们评估各方法在不同噪声水平下寻找精确解的能力。在这些受控实验中,我们得出结论:针对真实世界回归问题表现最优的方法是将遗传算法与参数估计和/或语义搜索驱动机制相结合的方法。 在存在噪声的情况下恢复精确方程时,我们发现深度学习和基于遗传算法的方法表现相似。我们提供了详细的实验复现指南及贡献新方法的说明,并鼓励其他研究人员与我们共同参与一个持续更新的符号回归基准测试合作。




*corresponding author  *通讯作者

Department of Automatics and Robotics,AGH University of Science and Technology,Krakow,Poland
自动控制与机器人学系,波兰克拉科夫 AGH 科技大学

Center for Mathematics,Computation and Cognition | Heuristics,Analysis and Learning Laboratory
数学、计算与认知中心 | 启发式、分析与学习实验室





1 Introduction  1 引言


Symbolic regression (SR) is an approach to machine learning (ML) in which both the parameters and structure of an analytical model are optimized. SR can be useful when one wishes to describe a process via a mathematical expression, especially a simple expression; thus, it is often applied in the hopes of producing a model of a process that, by virtue of its simplicity, may be easy to interpret. Interpretable ML is becoming increasingly important as model deployments in high stakes societal applications such as finance and medicine grow [1,2] . Moreover,the mathematical expressions produced by SR are well-suited to be analyzed and controlled for their out-of-distribution behavior (e.g., in terms of asymptotic behavior, periodicity, etc.). These attractive properties of SR have led to its application in a number of areas, such as physics [3], biology [4], clinical informatics [5], climate modeling [6], finance [7], and many fields of engineering [8-10].
符号回归(SR)是机器学习(ML)的一种方法,其中分析模型的参数和结构均被优化。当人们希望通过数学表达式(尤其是简洁表达式)描述某个过程时,SR 尤为有用;因此,它常被应用于生成过程模型,凭借其简洁性,这类模型可能易于解释。随着金融和医疗等高社会风险领域中模型部署的增加,可解释性机器学习正变得日益重要 [1,2] 。此外,SR 生成的数学表达式非常适合分析其分布外行为(如渐近行为、周期性等)并进行控制。SR 的这些吸引人特性使其在多个领域得到应用,例如物理学[3]、生物学[4]、临床信息学[5]、气候建模[6]、金融学[7]以及众多工程领域[8-10]。

SR literature has, in general, fallen short of evaluating and ranking new methods in a way that facilitates their widespread adoption. Our view is that this shortcoming largely stems from a lack of standardized, transparent and reproducible benchmarks, especially those that test a large and diverse array of problems [11]. Although community surveys [11, 12] have led to suggestions for improving benchmarking standards, and even black-listed certain problems, contemporary literature continues to be published that violates those standards. Absent these standards, it is difficult to assess which methods or family of methods should be considered "state-of-the-art" (SotA).
符号回归(SR)领域的文献在评估和排序新方法方面普遍存在不足,难以推动其广泛应用。我们认为这一缺陷主要源于缺乏标准化、透明且可复现的基准测试,尤其是那些能检验大量多样化问题的基准[11]。尽管社区调查[11,12]已提出改进基准测试标准的建议,甚至将某些问题列入黑名单,但当代文献仍不断出现违反这些标准的研究。缺乏这些标准,便难以评估哪些方法或方法系列应被视为"最先进技术"(SotA)。

Achieving a fleeting sense of SotA is certainly not the singular pursuit of methods research, yet without common, robust benchmarking studies, promising avenues of investigation cannot be well-informed by empirical evidence. We hope the benchmarking platform introduced in this paper improves the cross-pollination between research communities interested in SR, which include evolutionary computation, physics, engineering, statistics, and more traditional machine learning disciplines.
获得短暂的 SotA 认可显然不是方法研究的唯一目标,但若没有共同、稳健的基准测试研究,有前景的研究方向便无法充分获得实证依据的指导。我们希望本文介绍的基准测试平台能促进对 SR 感兴趣的研究群体之间的交叉融合,这些群体包括进化计算、物理学、工程学、统计学以及更传统的机器学习学科领域。

In this paper, we describe a large benchmarking effort that includes a dataset repository curated for SR, as well as a benchmarking library designed to allow researchers to easily contribute methods. To achieve this, we incorporated 130 datasets with ground truth forms into the Penn Machine Learning Benchmark (PMLB) [13], including metadata describing the underlying equations, their units, and various summary statistics. Furthermore,we created a SR benchmark repository called SRBench 4 and sought contributions from researchers in this area. Here we describe this process and the results, which consist of comparisons of 14 contemporary SR methods on hundreds of regression problems.
本文中,我们描述了一项大规模的基准测试工作,其中包括为符号回归(SR)精心策划的数据集存储库,以及一个旨在让研究人员能轻松贡献方法的基准测试库。为此,我们将 130 个带有真实形式的数据集纳入宾夕法尼亚机器学习基准(PMLB)[13],并附有描述基础方程、其单位及各类汇总统计的元数据。此外,我们创建了一个名为 SRBench 4 的符号回归基准存储库,并邀请该领域的研究者参与贡献。在此,我们详述这一过程及成果,包括对 14 种当代符号回归方法在数百个回归问题上的比较分析。

To our knowledge, this is by far the largest and most comprehensive SR benchmark effort to date, which allows us to make claims concerning current SotA methods for SR with better certainty. Importantly, and in contrast to many previous efforts, the datasets, methods, benchmarking code, and results are completely open-source, reproducible, and revision-controlled, which should allow SRBench to exist as a living benchmark for future studies.
据我们所知,这是迄今为止规模最大、最全面的符号回归基准测试工作,使我们能够以更高的确定性对当前最先进的符号回归方法做出论断。尤为重要的是,与以往许多努力不同,这些数据集、方法、基准测试代码及结果完全开源、可复现且受版本控制,这将使 SRBench 成为未来研究中一个持续活跃的基准平台。

2 Background and Motivation
2 背景与动机


The goal of SR is to learn a mapping y^(x)=ϕ^(x,θ^):RdR using a dataset of paired examples D={(xi,yi)}i=1N ,with features xRd and target y . SR assumes the existence of an analytical model of the form y(x)=ϕ(x,θ)+ϵ that would generate the observations in D ,and seeks to estimate this model by searching the space of expressions, ϕ ,and parameters, θ ,in the presence of white noise, ϵ .
符号回归(SR)的目标是通过一组配对样本 D={(xi,yi)}i=1N 学习一个映射关系 y^(x)=ϕ^(x,θ^):RdR ,其中包含特征 xRd 与目标变量 y 。SR 假设存在一个形式为 y(x)=ϕ(x,θ)+ϵ 的解析模型能生成 D 中的观测数据,并通过在表达式空间 ϕ 和参数空间 θ 中进行搜索,在存在白噪声 ϵ 的情况下估计该模型。




4 https://github.com/EpistasisLab/srbench





Koza [14] introduced SR as an application of genetic programming (GP), a field that investigates the use of genetic algorithms (GAs) to evolve executable data structures, i.e. programs. In the case of so-called "Koza-style" GP, the programs to be optimized are syntax trees consisting of functions/operations over input features and constants. Like in other GAs, GP is a process that evolves a population of candidate solutions (e.g., syntax trees) by iteratively producing offspring from parent solutions (e.g., by swapping parents' subtrees) and eliminating unfit solutions (e.g., programs with sub-par behavior). Most SR research to date has emerged from within this sub-field and its associated conferences. 5
Koza[14]将符号回归作为遗传编程(GP)的一种应用引入,该领域研究如何利用遗传算法(GAs)演化可执行的数据结构,即程序。在所谓的“Koza 风格”GP 中,待优化的程序是由对输入特征和常量进行函数/操作构成的语法树。与其他遗传算法类似,GP 通过迭代地从父代解生成子代解(例如通过交换父代子树)并淘汰不适配的解(如表现不佳的程序),来演化候选解(如语法树)的种群。迄今为止,大多数 SR 研究都源自这一子领域及其相关会议。 5

Despite the availability of post-hoc methods for explaining black-box model predictions [15], there have been recent calls to focus on learning interpretable/transparent models explicitly [2]. Perhaps due to this renewed interest in model interpretability, entirely different methods for tackling SR have been proposed [16-22]. These include methods based in Bayesian optimization [16], recurrent neural networks (RNNs) [17], and physics-inspired divide-and-conquer strategies [18, 23]. Some of these papers refer to Eureqa, a commercial, GP-based SR method used to re-discover known physics equations [3], as the "gold standard" for SR [17] and/or the best method for SR "by far" [18]. However, Schmidt and Lipson [24] make no claim to being the SotA method for SR, nor is this hypothesis tested in the body of work on which Eureqa is based [25].
尽管已有事后解释黑盒模型预测的方法[15],但近期有呼声呼吁明确关注学习可解释/透明模型[2]。或许由于对模型可解释性的重新关注,人们提出了完全不同的解决符号回归(SR)的方法[16-22],包括基于贝叶斯优化[16]、循环神经网络(RNN)[17]以及受物理学启发的分治策略[18,23]的方法。其中部分论文将 Eureqa(一种基于遗传编程的商业 SR 方法,曾用于重新发现已知物理方程[3])称为 SR 的“黄金标准”[17]和/或“迄今为止”最佳的 SR 方法[18]。然而,Schmidt 和 Lipson[24]并未宣称其是 SR 的最先进方法(SotA),且 Eureqa 所基于的研究工作中也未验证这一假设[25]。

Although commercial platforms like Eureqa and Wolfram [26] are successful tools for applying SR, they are not designed to support controlled benchmark experiments, and therefore experiments utilizing them have serious caveats. Due to the design of the front-end API for both tools, it is not possible to benchmark either method against others while holding important parameters of such an experiment constant, including the computational effort, number of model evaluations, CPU/memory limits, and final solution assessment. More generally, researchers cannot uniquely determine which features of the software and/or experiment lead to observed differences in performance, given that these commercial tools are closed-source. In this light, it is not clear what insights are to be gained when comparing to Eureqa and Wolfram beyond a simple head-to-head comparison. Therefore, rather than benchmark against Eureqa in this paper, we implement its underlying algorithms in an open-source package, which allows our experiment to remain transparent, reproducible, accessible, and controlled. We discuss the algorithms underlying Eureqa in detail in Sec. A.3.
尽管像 Eureqa 和 Wolfram[26]这样的商业平台是应用符号回归(SR)的成功工具,但它们的设计初衷并不支持受控基准实验,因此利用这些工具进行的实验存在严重的局限性。由于这两款工具的前端 API 设计,无法在保持实验关键参数(包括计算量、模型评估次数、CPU/内存限制以及最终解决方案评估)恒定的情况下,将任一方法与其他方法进行基准对比。更普遍地说,鉴于这些商业工具是闭源的,研究人员无法明确判定软件和/或实验的哪些特性导致了观察到的性能差异。有鉴于此,除了简单的直接对比外,与 Eureqa 和 Wolfram 进行比较能获得哪些洞尚不明确。因此,本文并未选择以 Eureqa 为基准,而是将其底层算法实现为一个开源软件包,从而确保我们的实验保持透明性、可复现性、可访问性和可控性。我们将在 A.3 节详细讨论 Eureqa 的底层算法。

A close reading of SR literature since 2009 implies that a number of proposed methods would outperform Eureqa in controlled tests, and are therefore suitable choices for benchmarking (e.g. [27, 28]). Unfortunately, the widespread adoption of these promising SR approaches is hamstrung by a lack of consensus on good benchmark problems, testing frameworks, and experimental designs. Our effort to establish a common benchmark is motivated by our view that common, robust, standardized benchmarks for SR could speed progress in the field by providing a clear baseline from which to assert the quality of new approaches. Consider the NN community's focus on common benchmarks (e.g. ImageNet [29]), frameworks (e.g. TensorFlow, PyTorch) and experiment designs. By contrast, it is common to observe results in SR literature that are based on a small number of low dimensional, easy and unrealistic problems, comparing only to very basic GP systems such as those described in [14] nearly thirty years ago. Despite detailed descriptions of these issues [11], community surveys and proposals to "black-list" toy problems [12], toy datasets and comparisons to out-dated SR methods continue to appear in contemporary literature.
仔细研读 2009 年以来的符号回归(SR)文献可知,许多已提出的方法在受控测试中表现优于 Eureqa,因此适合作为基准测试的选择(例如[27,28])。遗憾的是,由于缺乏关于良好基准问题、测试框架和实验设计的共识,这些有前景的 SR 方法未能得到广泛采用。我们建立通用基准的尝试源于一个观点:通过提供清晰的基线来断言新方法的质量,稳健、标准化的 SR 通用基准可以加速该领域的发展。以神经网络(NN)领域为例,其重点关注通用基准(如 ImageNet[29])、框架(如 TensorFlow、PyTorch)和实验设计。相比之下,SR 文献中常见的结果仅基于少量低维度、简单且不切实际的问题,且仅与近三十年前[14]中描述的基本遗传编程(GP)系统进行比较。 尽管已有对这些问题的详细描述[11]、社区调查以及将玩具问题列入"黑名单"的提案[12],但当代文献中仍不断出现玩具数据集与过时 SR 方法的比较。

The aspects of performance assessment for SR differ from typical regression benchmarking due to the interest in obtaining concise, symbolic expressions. In general, the trade-off between accuracy and simplicity must be considered when evaluating the merits of different models. Furthermore, model simplicity, typically measured as sparsity or model size, is but a proxy for model interpretability; a simple model may still be un-interpretable, or simply wrong [30-32]. With these concerns in mind, datasets with ground truth solutions are useful, in that they allow researchers to assess whether or not the symbolic model regressed by a given method corresponds to a known analytical solution. Nevertheless, benchmarks utilizing synthetic datasets with ground-truth solutions are not sufficient for assessing real-world performance, and so we consider it essential to also evaluate the performance of SR on real-world or otherwise black-box regression problems, relative to SotA ML methods.
符号回归(SR)的性能评估维度与传统回归基准测试有所不同,因其关注点在于获取简洁的符号化表达式。在评估不同模型的优劣时,通常需要权衡精度与简洁性。此外,模型简洁性(通常以稀疏性或模型规模衡量)仅是模型可解释性的代理指标;一个简单模型可能仍难以解释,或根本就是错误的[30-32]。鉴于这些问题,带有真实解的数据集非常有用,它们能让研究者判断某方法回归出的符号模型是否对应已知解析解。然而,利用含真实解合成数据集的基准测试不足以评估现实场景性能,因此我们认为必须同时评估 SR 在现实世界或黑盒回归问题上相对于机器学习领域前沿方法的表现。




5 A non-exhaustive list: GECCO,EuroGP,FOGA,PPSN,and IEEE CEC.
5 非详尽列表:GECCO、EuroGP、FOGA、PPSN 及 IEEE CEC。





There have been a few recent efforts to benchmark SR algorithms [33], including a precursor to this work benchmarking four SR methods on 94 regression problems [34]. In both cases, SR methods were assessed solely on their ability to make accurate predictions. In contrast, Udrescu and Tegmark [18] proposed 120 new synthetic, physics-based datasets for SR, but compared only to Eureqa and only in terms of solution rates. A major contribution of our work is its significantly more comprehensive scope than previous studies. We include 14SR methods on 252 datasets in comparison to 7ML methods. Our metrics of comparison are also more comprehensive, and include 1) accuracy, 2) simplicity, and 3) exact or approximate symbolic matches to the ground truth process. Furthermore, we have made the benchmark openly available, reproducible, and open for contributions supported by continuous integration [35].
近期已有若干研究致力于对符号回归(SR)算法进行基准测试[33],包括本项工作的前身——在 94 个回归问题上对四种 SR 方法进行基准测试的研究[34]。这两项研究仅评估了 SR 方法在预测准确性方面的表现。与之相反,Udrescu 和 Tegmark[18]提出了 120 个基于物理原理的新合成数据集用于 SR 测试,但仅与 Eureqa 进行了比较,且仅以求解率为评价标准。本项研究的主要贡献在于其覆盖范围较前人研究显著扩大:我们在 252 个数据集上对比了 14SR 种方法与 7ML 种方法。我们的评价指标也更为全面,包括:1)准确性,2)简洁性,以及 3)与真实物理过程的精确或近似符号匹配。此外,我们已将基准测试开源,确保其可复现性,并通过持续集成[35]支持社区贡献。

3 SRBench


We created SRBench to be a reproducible, open-source benchmarking project by pulling together a large set of diverse benchmark datasets, contemporary SR methods, and ML methods around a shared model evaluation and analysis environment. SRBench overcomes several of the issues in current benchmarking literature as described in Sec. 2. For example, it makes it easy for methodologists to benchmark new algorithms over hundreds of problems, in comparison to strong, contemporary reference methods. These improvements allow us to reason with more certainty than in previous work about the SotA methods for SR.
我们创建 SRBench 作为一个可复现的开源基准测试项目,通过整合大量多样化的基准数据集、现代符号回归方法以及围绕共享模型评估与分析环境的机器学习方法。SRBench 克服了如第 2 节所述的当前基准测试文献中的若干问题。例如,它使方法学家能够轻松地在数百个问题上对新算法进行基准测试,并与强大的现代参考方法进行比较。这些改进使我们能够比以往工作更有把握地论证符号回归领域的最先进方法。

In order to establish common datasets, we extended PMLB, a repository of standardized regression and classification problems [13, 36], by adding 130 SR datasets with known model forms. PMLB provides utilities for fetching and handling data, recording and visualizing dataset metadata, and contributing new datasets. The SR methods we benchmarked are all contemporary implementations (2011 - 2020) from several method families, as shown in Tbl. 1. We required contributors to implement a minimal, Scikit-learn compatible [37], Python API for their method. In addition, contributors were required to provide the final fitted model as a string that was compatible with the symbolic mathematics library sympy. Note that although we require a Python wrapper, SR implementations in many different languages are supported, as long as the Python API is available and the language environment can be managed via Anaconda 6 .
为了建立通用数据集,我们扩展了 PMLB(一个标准回归和分类问题库[13,36]),新增了 130 个已知模型形式的符号回归(SR)数据集。PMLB 提供数据获取与处理、记录及可视化数据集元数据、以及贡献新数据集的实用工具。我们基准测试的 SR 方法均来自多个方法家族的最新实现(2011-2020 年),如 Tbl.1 所示。我们要求贡献者为其方法实现一个与 Scikit-learn 兼容[37]的最小化 Python API。此外,贡献者还需以兼容符号数学库 sympy 的字符串形式提供最终拟合模型。请注意,尽管我们要求提供 Python 封装,但只要 Python API 可用且语言环境可通过 Anaconda 6 管理,支持多种编程语言的 SR 实现。

To ensure reproducibility, we defined a common environment (via Anaconda) with fixed versions of packages and their dependencies. In contrast to most SR studies, the full installation code, experiment code, results and analysis are available via the repository for use in future studies. To make SRBench as extensible as possible, we automated the process of incorporating new methods and results into the analysis pipeline. The repository accepts rolling contributions of new methods that meet the minimal API requirements. To achieve this, we created a continuous integration (CI) [35] framework that assures contributions are compatible with the benchmark code as they arrive. CI also supports continuous updates to results reporting and visualization whenever new experiments are available, allowing us to maintain a standing leader-board of contemporary SR methods. Ideally these features will quicken the adoption of SotA approaches throughout the SR research community. Further details on how to use and contribute to SRBench are provided in Sec. A.1.
为确保可复现性,我们通过 Anaconda 定义了一个包含固定版本包及其依赖的通用环境。与大多数符号回归(SR)研究不同,完整的安装代码、实验代码、结果及分析均可通过代码库获取,供未来研究使用。为使 SRBench 尽可能具备可扩展性,我们自动化了将新方法和结果纳入分析流程的过程。该代码库接受符合最低 API 要求的新方法滚动贡献。为此,我们创建了持续集成(CI)框架[35],确保贡献内容在提交时与基准代码兼容。CI 还支持在获得新实验数据时持续更新结果报告和可视化,使我们能维护一个实时更新的当代 SR 方法排行榜。理想情况下,这些特性将加速 SR 研究社区对前沿方法的采纳。关于如何使用及贡献 SRBench 的更多细节见附录 A.1 节。




6 https://www.anaconda.com/






Table 1: Short descriptions of the SR methods benchmarked in our experiment, including references and links to implementations.
表 1:本实验基准测试的符号回归方法简述,含参考文献及实现链接。

Method  方法Year  年份Description  描述Method Family  方法族Implementation  实现
AFP [38]2011Age-fitness Pareto Optimization
年龄适应度帕累托优化
GPC++/Python (link)  C++/Python(链接)
AFP_FE [24]2011AFP with co-evolved fitness estimates; Eureqa-esque
采用协同进化适应度估计的 AFP;类似 Eureqa 风格
GPC++/Python (link)  C++/Python(链接)
AIFeynman [23]2020Physics-inspired method  物理启发方法Divide and conquer  分而治之Fortran/Python (link)  Fortran/Python(链接)
BSR [16]2020Bayesian Symbolic Regression
贝叶斯符号回归
Markov Chain Monte Carlo  马尔可夫链蒙特卡洛Python (link)  Python(链接)
DSR [17]2020Deep Symbolic Regression  深度符号回归Recurrent neural networks
循环神经网络
Python (PyTorch) (link)  Python (PyTorch) (链接)
EPLEX [39]2016ϵ -lexicase selection   ϵ -词典式选择GPC++/Python (link)  C++/Python (链接)
FEAT [40]2019Feature Engineering Automation Tool
特征工程自动化工具
GPC++/Python (link)  C++/Python(链接)
FFX [41]2011Fast function extraction  快速函数提取Random search  随机搜索C++/Python (link)  C++/Python(链接)
GP-GOMEA [42]2020GP version of the Gene-pool Optimal Mixing Evolutionary Algorithm
基因池最优混合进化算法的 GP 版本
GPC++/Python (link)  C++/Python(链接)
gplearn2015Koza-style symbolic regression in Python
Python 中的 Koza 风格符号回归
GPC++/Python (link)  C++/Python(链接)
ITEA [43]2020Interaction-Transformation EA
交互式转换进化算法
GPHaskell/Python (link)  Haskell/Python(链接)
MRGP [44]2014Multiple Regression Genetic Programming
多元回归遗传编程
GPJava (link)  Java(链接)
Operon [45]2019SR with Non-linear least squares
非线性最小二乘法的符号回归
GPC++/Python (link)  C++/Python(链接)
SBP-GP [46]2019Semantic Back-propagation Genetic Programming
语义反向传播遗传编程
GPC++/Python (link)  C++/Python(链接)

Table 2: Settings used in the benchmark experiments. "Total comparisons" refers to the total evaluatons of an algorithm on a dataset for a given noise level and random seed.
表 2:基准测试中使用的设置。"总比较次数"指算法在给定噪声水平和随机种子下对数据集的评估总次数。

Setting  设置Black-box Problems  黑盒问题Ground-truth Problems  真实问题
No. of datasets  数据集数量122130
No. of algorithms  算法数量21 (14 SR, 7 ML)
21 个(14 个 SR,7 个 ML)
14
No. of trials per dataset
每个数据集的试验次数
1010
Train/test Split  训练/测试集划分.751.25.751.25
Hyperparameter Tuning  超参数调优5-fold Halving Grid Search CV
五折减半网格搜索交叉验证
Tuned set from Black-box problems
黑盒问题中的调优集
Termination criteria  终止条件500k evaluations/train or 48 hours
50 万次评估/训练或 48 小时
1M evaluations or 8 hours
1M 次评估或 8 小时
Levels of target noise  目标噪声水平None  0,0.001,0.01,0.1
Total comparisons  总比较次数2684054600
Computing Budget  计算预算1.29M core hours  129 万核心小时436.8K core hours  43.68 万核心小时


4 Experiment Design  4 实验设计


We evaluated SR methods on two separate tasks. First, we assessed their ability to make accurate predictions on "black-box" regression problems (in which the underlying data generating function remains unknown) while minimizing the complexity of the discovered models. Second, we tested the ability of each method to find exact solutions to synthetic datasets with known, ground-truth functions, originating from physics and various fields of engineering.
我们在两项独立任务上评估了符号回归方法。首先,我们评估了它们在"黑盒"回归问题(其中底层数据生成函数未知)上做出准确预测的能力,同时最小化所发现模型的复杂度。其次,我们测试了每种方法在已知真实函数的合成数据集(源自物理学和工程学各领域)上寻找精确解的能力。

The basic experiment settings are summarized in Tbl. 2. Each algorithm was trained on each dataset (and level of noise, for ground-truth problems) in 10 repeated trials with a different random state that controlled both the train/test split and the seed of the algorithm. Datasets were split 75/25% in training and testing. For black-box regression problems, each algorithm was tuned using 5 -fold cross validation with halving grid search. The SR algorithms were limited to 6 hyperparameter combinations; the ML methods were allowed more, as shown in Tbls. 4-6. The best hyperparameter settings were used to tune a final estimator and evaluate it according to the metrics described above. Details for running the experiment are given in Sec. A.1.
基础实验设置总结于表 2。每种算法在每组数据集(及针对真实问题添加的不同噪声水平)上进行了 10 次重复试验,每次采用不同的随机状态,该状态同时控制训练/测试集的划分和算法随机种子。数据集按 75%/25%的比例划分为训练集和测试集。对于黑盒回归问题,每种算法通过 5 折交叉验证配合半网格搜索进行调参。符号回归算法限用 6 组超参数组合;如表格 4-6 所示,机器学习方法允许使用更多组合。采用最优超参数配置训练最终估计器,并依据前述指标进行评估。实验运行细节详见附录 A.1 节。


4.1 Symbolic Regression Methods
4.1 符号回归方法


Here we characterize the SR methods summarized in Tbl. 1 by describing how they fit into broader research trends within the SR field. The most traditional implementation of GP-based SR we test is gplearn, which initializes a random population of programs/models, and then iterates through the steps of tournament selection, mutation and crossover.
本节通过阐述表 1 所列符号回归方法与领域研究趋势的关联来刻画其特征。我们测试的传统遗传编程符号回归实现是 gplearn,该工具先初始化随机程序/模型种群,然后迭代进行锦标赛选择、变异和交叉操作。

Pareto optimization methods [8,4749] are popular evolutionary strategies that exploit Pareto dominance relations to drive the population of models towards a set of efficient trade-offs between competing objectives. Half of the SR methods we test use Pareto optimization in some form during training. Age-Fitness Pareto optimization (AFP), proposed by Eureqa's authors Schmidt and Lipson [38], uses a model's age as an objective in order to reduce premature convergence as well as bloat [50]. AFP_FE combines AFP with Eureqa's method for fitness estimation [51]. Thus we expect AFP_FE and AFP to perform similarly to Eureqa as described in literature.
帕累托优化方法 [8,4749] 是一种流行的进化策略,它利用帕累托支配关系驱动模型群体在相互竞争的目标之间寻找一组有效的权衡方案。我们测试的 SR 方法中有一半在训练过程中以某种形式使用了帕累托优化。年龄-适应度帕累托优化(AFP)由 Eureqa 的作者 Schmidt 和 Lipson[38]提出,将模型的年龄作为一个目标,以减少过早收敛和膨胀[50]。AFP_FE 将 AFP 与 Eureqa 的适应度估计方法[51]相结合。因此,我们预期 AFP_FE 和 AFP 的表现会与文献中描述的 Eureqa 类似。

Another promising line of research has been to leverage program semantics (in this case, the equation's intermediate and final outputs over training samples) more heavily during optimization, rather than compressing that information into aggregate fitness values [52]. ϵ -lexicase selection (EPLEX) [27] is a parent selection method that utilizes semantics to conduct selection by filtering models through randomized subsets of cases, which rewards models that perform well on difficult regions of the training data. EPLEX is also used as the parent selection method in FEAT [40]. Semantic backpropagation (SBP) is another semantic technique to compute, for a given target value and a tree node position, that value which makes the output of the model match the target (i.e., the label) [53-55]. Here, we evaluate the (SBP-GP) algorithm by Virgolin et al. [46] which improves SBP-based recombination by dynamically adapting intermediate outputs using affine transformations.
另一项有前景的研究方向是在优化过程中更充分地利用程序语义(即方程在训练样本上的中间和最终输出),而非将这些信息压缩为聚合适应度值[52]。 ϵ -词典选择(EPLEX)[27]是一种利用语义进行选择的父代选择方法,通过随机案例子集筛选模型,奖励在训练数据困难区域表现良好的模型。EPLEX 也被用作 FEAT[40]中的父代选择方法。语义反向传播(SBP)是另一种语义技术,用于针对给定目标值和树节点位置计算使模型输出与目标(即标签)匹配的值[53-55]。本文评估了 Virgolin 等人提出的 SBP-GP 算法[46],该算法通过仿射变换动态调整中间输出,改进了基于 SBP 的重组过程。

Backpropagation-based gradient descent was proposed for GP-SR by Topchy and Punch [56], but tends to appear less often than stochastic hill climbing (e.g. [3, 57]). More recent studies [45, 58] have made a strong case for the use of gradient-based constant optimization as an improvement over stochastic and evolutionary approaches. The aforementioned studies are embodied by Operon, a GP method that incorporates non-linear least squares constant optimization using the Levenberg-Marquadt algorithm [59].
基于反向传播的梯度下降法由 Topchy 和 Punch [56]提出用于 GP-SR,但相较于随机爬山法(例如[3, 57])出现频率较低。近期研究[45, 58]有力地论证了基于梯度的常数优化相较于随机和进化方法的改进优势。这些研究在 Operon 中得到了体现,这是一种采用 Levenberg-Marquadt 算法[59]进行非线性最小二乘常数优化的 GP 方法。

In addition to the question of how to best optimize constants, a line of research has proposed different ways of defining program and/or model encodings. The methods FEAT, MRGP, ITEA, and FFX each impose additional structural assumptions on the models being evolved. In FEAT, each model is a linear combination of a set of evolved features, the parameters of which are encoded as edges and optimized via gradient descent. In MRGP [44], the entire program trace (i.e., each subfunction of the model) is decomposed into features and used to train a Lasso model. In ITEA, each model is an affine combination of interaction-transformation expressions, which compose a unary function (the transformation) and a polynomial function (the interaction) [43,60] . Finally, FFX[41] simply initializes a population of equations, selects the Pareto optimal set, and returns a single linear model by treating the population of equations as features.
除了如何最佳优化常数的问题外,一系列研究提出了定义程序和/或模型编码的不同方法。方法 FEAT、MRGP、ITEA 和 FFX 各自对正在进化的模型施加了额外的结构假设。在 FEAT 中,每个模型都是一组进化特征的线性组合,其参数被编码为边并通过梯度下降进行优化。在 MRGP[44]中,整个程序轨迹(即模型的每个子函数)被分解为特征并用于训练 Lasso 模型。在 ITEA 中,每个模型都是交互-变换表达式的仿射组合,这些表达式由一元函数(变换)和多项式函数(交互) [43,60] 组成。最后, FFX[41] 简单地初始化一组方程,选择帕累托最优集,并通过将方程群体视为特征来返回单个线性模型。

GP-GOMEA is a GP algorithm where recombination is adapted over time [42, 61]. Every generation, GP-GOMEA builds a statistical model of interdependencies within the encoding of the evolving programs, and then uses this information to recombine interdependent blocks of components, as to preserve their concerted action.
GP-GOMEA 是一种遗传编程算法,其重组机制会随时间动态调整[42, 61]。每一代进化中,该算法会建立统计模型来捕捉程序编码内部的相互依赖关系,并利用这些信息对相互关联的组件块进行重组,从而保持它们协同作用的完整性。

Jin et al. [16] recently proposed Bayesian Symbolic Regression (BSR), in which a prior is placed on tree structures and the posterior distributions are sampled using a Markov Chain Monte Carlo (MCMC) method. As in GP-based SR, arithmetic expressions are expressed with symbolic trees, although BSR explicitly defines the final model form as a linear combination of several symbolic trees. Model parsimony is encouraged by specifying a prior that presumes additive, linear combinations of small components.
Jin 等人[16]近期提出了贝叶斯符号回归(BSR)方法,该方法在树结构上设置先验分布,并采用马尔可夫链蒙特卡洛(MCMC)方法对后验分布进行采样。与基于 GP 的符号回归类似,算术表达式通过符号树表示,但 BSR 明确将最终模型定义为多个符号树的线性组合。通过设定倾向于小型组件线性加和的先验分布,该方法有效促进了模型的简约性。


Deep Symbolic Regression (DSR) [17] uses reinforcement learning to train a generative RNN model of symbolic expressions. Expressions sampled from the model distribution are assessed to create a reward signal. DSR introduces a variant of the Monte Carlo policy gradient algorithm [62] dubbed a "risk-seeking policy gradient" in an effort to bias the generative model towards exact expressions.
深度符号回归(DSR)[17]利用强化学习训练一个生成式 RNN 模型来构建符号表达式。从模型分布中采样得到的表达式会经过评估以生成奖励信号。DSR 引入了一种称为“风险寻求策略梯度”的蒙特卡洛策略梯度算法变体[62],旨在使生成模型更倾向于产生精确的表达式。

AIFeynman is a divide-and-conquer approach that recursively applies a set of solvers and problem decomposition heuristics to build a symbolic model [18]. If the problem is not directly solve-able by polynomial fitting or brute-force search, AIFeynman trains a NN on the data and uses it to estimate functional modularities (e.g., symmetry and/or separability), which are used to partition the data into simpler problems and recurse. An updated version of the algorithm, which we test here, integrates Pareto optimization with an information-theoretic complexity metric to improve robustness to noise [23].
AIFeynman 采用分治法,递归应用一系列求解器和问题分解启发式规则来构建符号模型[18]。若问题无法直接通过多项式拟合或暴力搜索解决,AIFeynman 会在数据上训练神经网络(NN),并利用其估计功能模块性(如对称性和/或可分离性),进而将数据分割为更简单的问题并递归处理。我们在此测试的算法更新版本整合了帕累托优化与信息论复杂度度量,以提升对噪声的鲁棒性[23]。

4.2 Datasets  4.2 数据集


All of the benchmark datasets are summarized by number of instances and number of features in Fig. 5. The problems range from 47 to 1 million instances, and two to 124 features. We used 122 black-box regression problems available in PMLB v.1.0. These problems are pulled from, and overlap with, various open-source repositories, including OpenML [63] and the UCI repository [64]. PMLB standardizes these datasets to a common format and provides fetching functions to load them into Python (and R). The black-box regression datasets consist of 46 "real-world" problems (i.e., observational data collected from physical processes) and 76 synthetic problems (i.e., data generated computationally from static functions or simulations). The black-box problems cover diverse domains, including health informatics (11), business (10), technology (10), environmental science (11) and government (12); in addition, they are derived from varied data sources, including human subjects (14), environmental observations (11), government studies (12), and economic markets (7). The datasets can be browsed by their properties at epistasislab.github.io/pmlb. Each dataset includes metadata describing source information as well as a detailed profile page summarizing the data distributions and interactions (here is an example).
所有基准数据集在实例数量和特征数量方面的总结如图 5 所示。问题规模从 47 到 100 万个实例不等,特征数量从 2 到 124 个不等。我们使用了 PMLB v.1.0 中提供的 122 个黑盒回归问题。这些问题源自并重叠于多个开源存储库,包括 OpenML [63]和 UCI 存储库[64]。PMLB 将这些数据集标准化为统一格式,并提供了加载到 Python(和 R)中的获取函数。黑盒回归数据集包括 46 个“真实世界”问题(即从物理过程中收集的观测数据)和 76 个合成问题(即通过静态函数或模拟计算生成的数据)。这些黑盒问题涵盖了多个领域,包括健康信息学(11 个)、商业(10 个)、技术(10 个)、环境科学(11 个)和政府(12 个);此外,它们还来源于不同的数据源,包括人类受试者(14 个)、环境观测(11 个)、政府研究(12 个)和经济市场(7 个)。 数据集可通过其属性在 epistasislab.github.io/pmlb 上浏览。每个数据集包含描述来源信息的元数据,以及详细的数据分布和交互摘要页面(此处为示例)。

We extended PMLB with 130 datasets with known, ground-truth model forms. These datasets were used to assess the ability of SR methods to recover known process physics. The 130 datasets came from two sources: the Feynman Symbolic Regression Database, and the ODE-Strogatz repository. Both sets of data come from first principles models of physical systems. The Feynman problems originate in the Feynman Lectures on Physics [65], and the datasets were recently created and proposed as SR benchmarks [18]. Whereas the Feynman datasets represent static systems, the Strogatz problems are non-linear and chaotic dynamical processes [66]. Each dataset is one state of a 2-state system of first-order, ordinary differential equations (ODEs). They were used to benchmark SR methods in previous work [25,67] ,and are described in more detail in Sec. A.4 and Tbl. 3.
我们将 PMLB 扩展至包含 130 个具有已知真实模型形式的数据集。这些数据集用于评估符号回归方法恢复已知物理过程的能力。130 个数据集来自两个来源:费曼符号回归数据库和 ODE-Strogatz 存储库。两组数据均源自物理系统的第一性原理模型。费曼问题起源于《费曼物理学讲义》[65],相关数据集近期被创建并提议作为符号回归基准[18]。费曼数据集代表静态系统,而 Strogatz 问题涉及非线性和混沌动力学过程[66]。每个数据集均为二阶常微分方程(ODE)系统的一个状态,曾用于前期研究中符号回归方法的基准测试 [25,67] ,详见章节 A.4 和表 3。


Accuracy We assessed accuracy using the coefficient of determination, defined as
准确性 我们使用决定系数评估准确性,其定义为

R2=1iN(yiy^i)2iN(yiy¯i)2.

Complexity A number of different complexity measures have been proposed for SR, including those based on syntactic complexity (i.e. related to the complexity of the symbolic model); those based on semantic complexity (i.e. related to the behavior of the model over the data) [23, 68]; those using both definitions [69]; and those estimating complexity via meta-learning [70]. The pros and cons of these methods and their relation to notions of interpretability is a point of discussion [71]. For the sake of simplicity, we opted to define complexity as the number of mathematical operators, features and constants in the model, where the mathematical operators are in the set {+,,,/,sin,cos,arcsin,arccos,exp,log, pow,max,min } . In addition to calculating the complexity of the raw model forms returned by each method, we calculated the complexity of the models after simplifying via sympy.
复杂性度量 在符号回归(SR)中已提出多种不同的复杂性度量标准,包括基于句法复杂性(即与符号模型复杂度相关)的指标;基于语义复杂性(即与模型在数据上的行为相关)的指标[23,68];同时采用两种定义的指标[69];以及通过元学习估算复杂性的方法[70]。这些方法的优缺点及其与可解释性概念的关系是当前讨论的焦点[71]。为简化起见,我们选择将复杂性定义为模型中数学运算符、特征和常量的数量,其中数学运算符属于集合 {+,,,/,sin,cos,arcsin,arccos,exp,log, pow,max,min } 。除计算各方法返回原始模型形式的复杂度外,我们还通过 sympy 简化后计算了模型的复杂度。

Solution Criteria For the ground-truth regression problems, we used the following solution definition.
解判定标准 对于真实回归问题,我们采用以下解定义。

Definition 4.1 (Symbolic Solution). A model ϕ^(x,θ^) is a Symbolic Solution to a problem with ground-truth model y=ϕ(x,θ)+ϵ ,if ϕ^ does not reduce to a constant,and if either of the following conditions are true: 1) ϕϕ^=a ; or 2) ϕ/ϕ^=b,b0 ,for some constants a and b .
定义 4.1(符号解)。模型 ϕ^(x,θ^) 作为问题真实模型 y=ϕ(x,θ)+ϵ 的符号解,需满足: ϕ^ 不退化为一常数,且符合以下任一条件:1) ϕϕ^=a ;或 2) 存在常数 ab 使得 ϕ/ϕ^=b,b0 成立。

This definition is designed to capture models that differ from the true model by a constant or scalar. Prior to assessing symbolic solutions, each model underwent sympy simplification, as did the conditions above. Relative to accuracy metrics, the Symbolic Solution metric is a more faithful evaluation of the ability of an SR method to discover the data generating process. However, because models can be represented in myriad ways, and sympy's simplification procedure is non-optimal, we cannot guarantee that all symbolic solutions are captured with perfect fidelity by this metric.
该定义旨在捕捉与真实模型仅相差常数或比例系数的模型。在评估符号解前,所有模型及上述条件均经过 sympy 库的代数简化处理。相较于准确度指标,符号解指标能更真实地反映符号回归方法发现数据生成过程的能力。然而,由于模型表达形式多样且 sympy 的简化过程非绝对最优,我们无法保证该指标能完美无缺地识别所有符号解。

5 Results  5 结果


The median test set performance on all problems and methods for the black-box benchmark problems is summarized in Fig. 1. Across the problems, we find that the models generated by Operon are significantly more accurate than any other method’s models in terms of test set R2 ( p6.5 e-05). SBP-GP and FEAT rank second and third and attain similar accuracies, although the models produced by FEAT are significantly smaller (p=9.2e22) .
图 1 总结了黑盒基准问题中所有问题和方法的中位数测试集性能。在所有问题中,我们发现 Operon 生成的模型在测试集 R2p6.5 e-05)方面显著优于其他任何方法的模型。SBP-GP 和 FEAT 分别排名第二和第三,并达到了相似的准确度,尽管 FEAT 生成的模型显著更小 (p=9.2e22)

We note that four of the top five methods (Operon, SBP-GP, FEAT, EPLEX) and six of the top ten methods (GP-GOMEA, ITEA) are GP-based SR methods. The other top methods are ensemble tree-based methods, including two popular gradient-boosting algorithms, XGBoost and LightGBM [72, 73]); Random Forest [74]; and AdaBoost [75]. Among these methods, Operon, FEAT and SBP-GP significantly outperform and LightGBM ( p1.3e07 ) and Operon and SBP-GP outperform XGBoost ( p1.3e04 ). We also note ITEA’s overall accuracy is not significantly different from RandomForest or AdaBoost. Of note, the models produced by the top five SR methods (aside from SBP-GP) are 1-3 orders of magnitude smaller than models produced by the ensemble tree-based approaches (p1.3e21) . Among the non-GP-based SR algorithms,FFX and DSR perform similarly to each other (p=0.76) and significantly better than BSR and AIFeynman (p6.1e05) . FFX trains more quickly than DSR, although DSR produces some of the smallest solutions, akin to penalized regression. We note that AIFeynman performs poorly on these problems, suggesting that not many of them exhibit the qualities of physical systems (rotational/translational invariance, symmetry, etc.) that AIFeynman was designed to exploit. Additional statistical comparisons are given in Figs. 9-11.
我们注意到,排名前五的方法中有四个(Operon、SBP-GP、FEAT、EPLEX)以及前十名中的六个方法(GP-GOMEA、ITEA)均基于遗传规划(GP)的符号回归(SR)方法。其他顶尖方法为基于集成树的算法,包括两种流行的梯度提升算法——XGBoost 和 LightGBM[72,73]、随机森林[74]以及 AdaBoost[75]。在这些方法中,Operon、FEAT 和 SBP-GP 显著优于 LightGBM( p1.3e07 ),而 Operon 和 SBP-GP 也优于 XGBoost( p1.3e04 )。同时,ITEA 的整体准确性与随机森林或 AdaBoost 无显著差异。值得注意的是,前五名 SR 方法(除 SBP-GP 外)生成的模型规模比基于集成树的方法小 1-3 个数量级( (p1.3e21) )。在非基于 GP 的 SR 算法中,FFX 与 DSR 表现相近( (p=0.76) ),且显著优于 BSR 和 AIFeynman( (p6.1e05) )。FFX 的训练速度比 DSR 更快,尽管 DSR 能生成一些最小规模的解,类似于惩罚回归。 我们注意到 AIFeynman 在这些问题上表现不佳,这表明其中许多问题并不具备物理系统(如旋转/平移不变性、对称性等)的特性,而 AIFeynman 正是为利用这些特性而设计的。更多统计比较见图 9 至 11。




https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_8.jpg?x=313&y=293&w=1057&h=570&r=0

Figure 1: Results on the black-box regression problems. Points indicate the mean of the median test set performance on all problems, and bars show the 95% confidence interval. Methods marked with an asterisk are SR methods.
图 1:黑盒回归问题的结果。点表示所有问题在测试集上中位数性能的平均值,条形图显示 95%置信区间。标有星号的方法为符号回归方法。


https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_8.jpg?x=311&y=1177&w=612&h=600&r=0


https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_8.jpg?x=939&y=1276&w=554&h=502&r=0

Figure 2: Pareto plot comparing the rankings of Figure 3: Solution rates for the ground-truth SR methods in terms of model size and R2 score on regression problems. Color/shape indicates level the black-box problems. Points denote median rank- of noise added to the target variable. ings and the bars denote 95% confidence intervals. Connecting lines and color denote Pareto dominance rankings.
图 2:帕累托图比较了图 3 中真实符号回归方法在模型大小与回归问题 R2 分数上的排名。颜色/形状表示黑盒问题中目标变量添加的噪声水平。点代表中位数排名,条形图表示 95%置信区间。连接线与颜色表示帕累托支配等级。



In Fig. 2, we illustrate the performance of the methods on the black-box problems when accuracy and simplicity are considered simultaneously. The optimal Pareto front for these two objectives (solid line) is composed of three methods: Operon, GP-GOMEA, and DSR, which taken together give the set of best trade-offs between accuracy and simplicity across the black-box regression problems.
在图 2 中,我们展示了当同时考虑准确性和简洁性时,各方法在黑盒问题上的性能表现。针对这两个目标的最优帕累托前沿(实线)由三种方法构成:Operon、GP-GOMEA 和 DSR,它们共同提供了黑盒回归问题中准确性与简洁性之间的最佳权衡集合。

Performance on the ground-truth regression problems is summarized in Fig. 3, with methods sorted by their median solution rate and grouped by data source (Feynman or Strogatz). On average, when the target is free of noise, we observe that AIFeynman identifies exact solutions 53% of the time, nearly twice as often as the next closest method (GP-GOMEA, 27%). However, at noise levels above 0.01, four other methods recover exact solutions more often: DSR, gplearn, AFP_FE, and AFP. Taken together, the black-box and ground-truth regression results suggest AIFeynman may be brittle in application to real-world and/or noisy data, yet its performance with little to no noise is significant for the Feynman problems. On the Strogatz datasets, AIFeynman's performance is not significantly different than other methods, and indeed there are few significant differences in performance between the top 10 methods at any noise level. We note that the best method on real-world data, Operon, struggles to recover solutions to these problems, despite finding many candidate solutions with near prefect test set scores. See Sec. A.6-A.7 for additional analysis.
图 3 总结了在真实回归问题上的性能表现,方法按其解决率中位数排序并按数据来源(Feynman 或 Strogatz)分组。平均而言,当目标无噪声时,我们观察到 AIFeynman 在 53%的情况下识别出精确解,几乎是次优方法(GP-GOMEA,27%)的两倍。然而,在噪声水平超过 0.01 时,其他四种方法更频繁地恢复精确解:DSR、gplearn、AFP_FE 和 AFP。综合来看,黑盒与真实回归结果表明,AIFeynman 在应用于现实世界和/或含噪声数据时可能表现脆弱,但在几乎没有噪声的 Feynman 问题上其性能显著。在 Strogatz 数据集上,AIFeynman 的表现与其他方法无显著差异,事实上在任何噪声水平下,前十名方法之间的性能差异均较小。值得注意的是,在真实数据上表现最优的方法 Operon,尽管找到了许多测试集分数近乎完美的候选解,却难以恢复这些问题的解。更多分析参见附录 A.6 至 A.7 节。

6 Discussion and Conclusions
6 讨论与结论


This paper introduces a SR benchmarking framework that allows objective comparisons of contemporary SR methods on a wide range of diverse regression problems. We have found that, on real-world and black-box regression tasks, contemporary GP-based SR methods (e.g. Operon) outperform new SR methods based in other fields of optimization, and can also perform as well as or better than gradient boosted trees while producing simpler models. On synthetic ground-truth physics and dynamical systems problems, we have verified that AIFeynman finds exact solutions significantly better than other methods when noise is minimal; otherwise, both deep learning-based methods (DSR) and GP-based SR methods (e.g. AFP_FE) perform best.
本文介绍了一种 SR 基准测试框架,可在多种回归问题上对当代 SR 方法进行客观比较。我们发现,在现实世界和黑盒回归任务中,基于遗传编程(GP)的当代 SR 方法(如 Operon)表现优于基于其他优化领域的新 SR 方法,且性能与梯度提升树相当甚至更优,同时生成更简洁的模型。在合成真实物理和动态系统问题上,我们验证了当噪声最小时,AIFeynman 找到精确解的能力显著优于其他方法;否则,基于深度学习的方法(DSR)和基于 GP 的 SR 方法(如 AFP_FE)表现最佳。

We see clear ways to improve SRBench by improving the dataset curation, experiment design and analysis. For one, we have not benchmarked the methods in a setting that allows them to exploit parallelism, which may change relative run-times. There are also many promising SR methods not included in this study that we hope to add in future revisions. In addition, whereas our benchmark includes real-world data as well as simulated data with ground-truth models, it does not include real-world data from phenomena with known, first principles models (e.g., observations of a mass-spring-damper system). Data such as these could help us better evaluate the ability of SR methods to discover relations under real-world conditions. We intend to include these data in future versions, given the evidence that SR models can sometimes discover unexpected analytical models that outperform the expert models in a field (e.g., in studies of yeast metabolism [76] and fluid tank systems [67]). As a final note, our current study highlights orthogonal approaches to SR that show promise, and in future work we hope to explore whether combinations of proposed methods (e.g., non-linear parameter optimization plus semantic search drivers) would have synergistic effects.
我们清楚地看到了通过改进数据集整理、实验设计和分析来提升 SRBench 的途径。首先,我们尚未在允许方法利用并行性的环境中进行基准测试,这可能会改变相对运行时间。此外,还有许多有前景的符号回归方法未包含在本研究中,我们希望在未来的修订版中加入。另外,虽然我们的基准测试包含了真实世界数据以及带有真实模型的模拟数据,但并未包含来自已知第一性原理模型现象的真实世界数据(例如质量-弹簧-阻尼系统的观测数据)。这类数据可以帮助我们更好地评估符号回归方法在真实条件下发现关系的能力。鉴于有证据表明符号回归模型有时能发现超越领域专家模型的意外解析模型(如在酵母代谢[76]和流体罐系统[67]的研究中),我们计划在未来的版本中包含这些数据。 最后需要指出的是,我们当前的研究凸显了在 SR 领域几种具有前景的正交方法,未来工作中,我们期望探索所提方法(如非线性参数优化结合语义搜索驱动因子)的组合是否会产生协同效应。


Acknowledgments  致谢


William La Cava was supported by NIH/NLM grant K99-LM012926. He would like to thank Curt Calafut and the Penn Medicine Academic Computing Services (PMACS), as well as the PLGrid Infrastructure, for supporting the computational experiments. He also thanks members of the Epistasis Lab for their patience, and Joseph D. Romano for coming through in a pinch.
William La Cava 获得美国国立卫生研究院/国家医学图书馆资助 K99-LM012926 的支持。他感谢 Curt Calafut 和宾夕法尼亚医学院学术计算服务(PMACS)以及 PLGrid 基础设施对计算实验的支持,同时感谢 Epistasis Lab 成员的耐心等待,以及 Joseph D. Romano 在紧要关头伸出援手。

Ying Jin would like to thank Doctor Jian Guo for hosting an internship for the project and Professor Jian Kang for helpful and inspiring guidance in Bayesian statistics.
Ying Jin 感谢 Jian Guo 医生为该项目提供实习机会,并感谢 Jian Kang 教授在贝叶斯统计方面给予有益且富有启发性的指导。

The authors would also like to thank James McDermott for his generous contributions to the repository, and Randal Olson and Weixuan Fu for their initial push to integrate regression benchmarking into PMLB. Authors declare no competing interests.
作者还要感谢 James McDermott 对代码库的慷慨贡献,以及 Randal Olson 和 Weixuan Fu 最初推动将回归基准测试整合到 PMLB 中。作者声明无利益冲突。

References  参考文献


[1] Anna Jobin, Marcello Ienca, and Effy Vayena. The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9):389-399, September 2019. ISSN 2522-5839. doi: 10.1038/s42256-019-0088-2.
[1] Anna Jobin, Marcello Ienca, 和 Effy Vayena。全球人工智能伦理指南概览。《自然-机器智能》,1(9):389-399,2019 年 9 月。ISSN 2522-5839。doi: 10.1038/s42256-019-0088-2。

[2] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206-215, 2019.
[2] Cynthia Rudin。停止为高风险决策解释黑箱机器学习模型,转而使用可解释模型。《自然-机器智能》,1(5):206-215,2019 年。

[3] Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. Science, 324 (5923):81-85, 2009.
[3] Michael Schmidt 和 Hod Lipson。从实验数据中提炼自由形式的自然法则。《科学》,324(5923):81-85,2009 年。

[4] Michael Douglas Schmidt and Hod Lipson. Automated modeling of stochastic reactions with large measurement time-gaps. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, pages 307-314. ACM, 2011.
[4] Michael Douglas Schmidt 与 Hod Lipson。大测量时间间隔下随机反应的自动化建模。见《第 13 届遗传与进化计算年会论文集》,307-314 页。ACM,2011 年。

[5] William La Cava, Paul C. Lee, Imran Ajmal, Xiruo Ding, Priyanka Solanki, Jordana B. Cohen, Jason H. Moore, and Daniel S. Herman. Application of concise machine learning to construct accurate and interpretable EHR computable phenotypes. medRxiv, page 2020.12.12.20248005, February 2021. doi: 10.1101/2020.12.12.20248005.
[5] William La Cava、Paul C. Lee、Imran Ajmal、Xiruo Ding、Priyanka Solanki、Jordana B. Cohen、Jason H. Moore 及 Daniel S. Herman。应用简明机器学习构建精准可解释的电子健康记录可计算表型。medRxiv,编号 2020.12.12.20248005,2021 年 2 月。doi: 10.1101/2020.12.12.20248005。

[6] Karolina Stanislawska, Krzysztof Krawiec, and Zbigniew W. Kundzewicz. Modeling global temperature changes with genetic programming. Computers & Mathematics with Applications, 64(12):3717-3728, December 2012. ISSN 0898-1221. doi: 10.1016/j.camwa.2012.02.049.
[6] Karolina Stanislawska、Krzysztof Krawiec 及 Zbigniew W. Kundzewicz。利用遗传编程模拟全球温度变化。《计算机与数学应用》,64 卷 12 期:3717-3728 页,2012 年 12 月。ISSN 0898-1221。doi: 10.1016/j.camwa.2012.02.049。

[7] Shu-Heng Chen. Genetic Algorithms and Genetic Programming in Computational Finance. Springer Science & Business Media, 2012.
[7] Shu-Heng Chen。计算金融中的遗传算法与遗传编程。Springer 科学与商业媒体,2012 年。

[8] Guido F. Smits and Mark Kotanchek. Pareto-front exploitation in symbolic regression. In Genetic Programming Theory and Practice II, pages 283-299. Springer, 2005.
[8] 吉多·F·斯米茨与马克·科坦切克。符号回归中的帕累托前沿开发。收录于《遗传编程理论与实践 II》,第 283-299 页。施普林格出版社,2005 年。

[9] William La Cava, Kourosh Danai, Lee Spector, Paul Fleming, Alan Wright, and Matthew Lackner. Automatic identification of wind turbine models using evolutionary multiobjective optimization. Renewable Energy, 87, Part 2:892-902, March 2016. ISSN 0960-1481. doi: 10.1016/j.renene.2015.09.068.
[9] 威廉·拉卡瓦、库罗什·达奈、李·斯佩克特、保罗·弗莱明、艾伦·赖特与马修·拉克纳。基于进化多目标优化的风力涡轮机模型自动辨识。《可再生能源》,第 87 卷,第 2 部分:892-902 页,2016 年 3 月。ISSN 0960-1481。doi: 10.1016/j.renene.2015.09.068。

[10] Mauro Castelli, Sara Silva, and Leonardo Vanneschi. A C++ framework for geometric semantic genetic programming. Genetic Programming and Evolvable Machines, 16(1):73-81, March 2015. ISSN 1389-2576, 1573-7632. doi: 10.1007/s10710-014-9218-0.
[10] 毛罗·卡斯泰利、萨拉·席尔瓦与莱昂纳多·万内斯基。面向几何语义遗传编程的 C++框架。《遗传编程与可进化机器》,第 16 卷第 1 期:73-81 页,2015 年 3 月。ISSN 1389-2576, 1573-7632。doi: 10.1007/s10710-014-9218-0。

[11] James McDermott, David R. White, Sean Luke, Luca Manzoni, Mauro Castelli, Leonardo Vanneschi, Wojciech Jaskowski, Krzysztof Krawiec, Robin Harper, and Kenneth De Jong. Genetic programming needs better benchmarks. In Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference, pages 791-798. ACM, 2012.
[11] 詹姆斯·麦克德莫特、大卫·R·怀特、肖恩·卢克、卢卡·曼佐尼、毛罗·卡斯泰利、莱昂纳多·万内斯基、沃伊切赫·雅斯科夫斯基、克日什托夫·克拉维茨、罗宾·哈珀与肯尼思·德容。遗传编程需要更好的基准测试。收录于《第十四届遗传与进化计算国际会议论文集》,第 791-798 页。ACM 出版社,2012 年。


[12] David R. White, James McDermott, Mauro Castelli, Luca Manzoni, Brian W. Goldman, Gabriel Kronberger, Wojciech Jaśkowski, Una-May O'Reilly, and Sean Luke. Better GP benchmarks: Community survey results and proposals. Genetic Programming and Evolvable Machines, 14(1):3-29, December 2012. ISSN 1389-2576, 1573-7632. doi: 10.1007/s10710-012-9177-2.
[12] 戴维·R·怀特、詹姆斯·麦克德莫特、毛罗·卡斯泰利、卢卡·曼佐尼、布莱恩·W·戈德曼、加布里埃尔·克龙伯格、沃伊切赫·雅什科夫斯基、尤娜-梅·奥莱利与肖恩·卢克。《更优的遗传编程基准:社区调查结果与建议》。载《遗传编程与可进化机器》,14 卷 1 期,3-29 页,2012 年 12 月。ISSN 1389-2576, 1573-7632。doi: 10.1007/s10710-012-9177-2。

[13] Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison. BioData Mining, 2017.
[13] 兰德尔·S·奥尔森、威廉·拉卡瓦、帕特里克·奥热霍夫斯基、瑞安·J·乌尔班诺维奇与杰森·H·摩尔。《PMLB:用于机器学习评估与对比的大规模基准测试集》。载《生物数据挖掘》,2017 年。

[14] John R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992. ISBN 0-262-11170-5.
[14] 约翰·R·科扎。《遗传编程:通过自然选择实现计算机编程》。美国麻省理工学院出版社,1992 年。ISBN 0-262-11170-5。

[15] Scott M Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. page 10.
[15] 斯科特·M·伦德伯格与李素茵。《解释模型预测的统一方法》,共 10 页。

[16] Ying Jin, Weilin Fu, Jian Kang, Jiadong Guo, and Jian Guo. Bayesian Symbolic Regression. arXiv:1910.08892 [stat], January 2020.
[16] 应瑾、魏林富、康健、郭佳东与郭健。贝叶斯符号回归。arXiv:1910.08892 [统计],2020 年 1 月。

[17] Brenden K. Petersen, Mikel Landajuela Larma, Terrell N. Mundhenk, Claudio Prata Santiago, Soo Kyung Kim, and Joanne Taery Kim. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. In International Conference on Learning Representations, September 2020.
[17] 布伦登·K·彼得森、米克尔·兰达胡埃拉·拉尔马、特雷尔·N·蒙登克、克劳迪奥·普拉塔·圣地亚哥、金秀京与金乔安妮·泰瑞。深度符号回归:通过风险寻求策略梯度从数据中恢复数学表达式。国际学习表征会议,2020 年 9 月。

[18] Silviu-Marian Udrescu and Max Tegmark. AI Feynman: A Physics-Inspired Method for Symbolic Regression. arXiv:1905.11481 [hep-th, physics:physics], April 2020.
[18] 西尔维乌-马里安·乌德雷斯库与马克斯·泰格马克。AI 费曼:一种受物理学启发的符号回归方法。arXiv:1905.11481 [高能物理-理论,物理学],2020 年 4 月。

[19] Maysum Panju. Automated Knowledge Discovery Using Neural Networks. 2021.
[19] 梅苏姆·潘朱。基于神经网络的自动化知识发现。2021 年。

[20] Matthias Werner, Andrej Junginger, Philipp Hennig, and Georg Martius. Informed Equation Learning. arXiv preprint arXiv:2105.06331, 2021.
[20] 马蒂亚斯·维尔纳、安德烈·容金格、菲利普·亨尼希与乔治·马蒂乌斯。《基于信息的方程学习》。arXiv 预印本 arXiv:2105.06331,2021 年。

[21] Subham Sahoo, Christoph Lampert, and Georg Martius. Learning equations for extrapolation and control. In International Conference on Machine Learning, pages 4442-4450. PMLR, 2018.
[21] 苏巴姆·萨胡、克里斯托夫·兰佩特与乔治·马蒂乌斯。《用于外推与控制的方程学习》。收录于《国际机器学习会议》,第 4442-4450 页。PMLR,2018 年。

[22] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In International Conference on Machine Learning, pages 1945-1954. PMLR, 2017.
[22] 马特·J·库斯纳、布鲁克斯·佩奇与何塞·米格尔·埃尔南德斯-洛巴托。《语法变分自编码器》。收录于《国际机器学习会议》,第 1945-1954 页。PMLR,2017 年。

[23] Silviu-Marian Udrescu, Andrew Tan, Jiahai Feng, Orisvaldo Neto, Tailin Wu, and Max Tegmark. AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity. arXiv:2006.10782 [physics, stat], December 2020.
[23] 西尔维乌-马里安·乌德雷斯库、安德鲁·谭、冯家海、奥里斯瓦尔多·内托、吴泰霖与马克斯·泰格马克。《AI 费曼 2.0:利用图模块性的帕累托最优符号回归》。arXiv:2006.10782 [物理学, 统计],2020 年 12 月。

[24] Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. Science, 324 (5923):81-85, 2009.
[24] 迈克尔·施密特与霍德·利普森。从实验数据中提炼自由形式的自然定律。《科学》期刊,324 卷(5923 期):81-85 页,2009 年。

[25] Michael Douglas Schmidt. Machine Science: Automated Modeling of Deterministic and Stochastic Dynamical Systems. PhD thesis, Cornell University, Ithaca, NY, USA, 2011.
[25] 迈克尔·道格拉斯·施密特。机器科学:确定性及随机动力系统的自动化建模。博士论文,美国纽约州伊萨卡康奈尔大学,2011 年。

[26] Giorgia Fortuna. Automatic Formula Discovery in the Wolfram Language - from Wolfram Library Archive. https://library.wolfram.com/infocenter/Conferences/9329/, 2015.
[26] 乔治娅·福图纳。Wolfram 语言中的自动公式发现——来自 Wolfram 图书馆存档。https://library.wolfram.com/infocenter/Conferences/9329/,2015 年。

[27] William La Cava, Lee Spector, and Kourosh Danai. Epsilon-Lexicase Selection for Regression. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO '16, pages 741-748, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4206-3. doi: 10.1145/2908812.2908898.
[27] 威廉·拉卡瓦、李·斯佩克特与库罗什·达奈。回归任务中的 Epsilon-Lexicase 选择算法。《遗传与进化计算会议 2016 论文集》,GECCO '16 会议录,741-748 页,美国纽约州,2016 年。ACM 出版社。ISBN 978-1-4503-4206-3。doi: 10.1145/2908812.2908898。

[28] Pawell Liskowski and Krzysztof Krawiec. Discovery of Search Objectives in Continuous Domains. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO '17, pages 969-976, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4920-8. doi: 10.1145/3071178.3071344.
[28] Pawell Liskowski 与 Krzysztof Krawiec。连续域中搜索目标的发现。载于《遗传与进化计算会议论文集》,GECCO '17,第 969-976 页,美国纽约州纽约市,2017 年。ACM 出版社。ISBN 978-1-4503-4920-8。doi: 10.1145/3071178.3071344。

[29] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248-255. Ieee, 2009.
[29] 邓嘉、董伟、Richard Socher、李立杰、李凯与李飞飞。ImageNet:一个大规模层次化图像数据库。载于《2009 年 IEEE 计算机视觉与模式识别会议》,第 248-255 页。IEEE,2009 年。


[30] Zachary C Lipton. The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31-57, 2018.
[30] Zachary C Lipton。模型可解释性的迷思:在机器学习中,可解释性概念既重要又难以捉摸。《Queue》期刊,16(3):31-57,2018 年。

[31] Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Wortman Vaughan, and Hanna Wallach. Manipulating and measuring model interpretability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1-52, 2021.
[31] Forough Poursabzi-Sangdeh、Daniel G Goldstein、Jake M Hofman、Jennifer Wortman Wortman Vaughan 与 Hanna Wallach。操纵与衡量模型可解释性。载于《2021 年 CHI 人机交互系统会议论文集》,第 1-52 页,2021 年。

[32] Marco Virgolin, Andrea De Lorenzo, Francesca Randone, Eric Medvet, and Mattias Wahde. Model learning with personalized interpretability estimation (ML-PIE). arXiv:2104.06060 [cs], 2021.
[32] Marco Virgolin, Andrea De Lorenzo, Francesca Randone, Eric Medvet 和 Mattias Wahde。模型学习与个性化可解释性评估(ML-PIE)。arXiv:2104.06060 [cs],2021 年。

[33] Jan Zegklitz and Petr Pošik. Benchmarking state-of-the-art symbolic regression algorithms. Genetic Programming and Evolvable Machines, pages 1-29, 2020.
[33] Jan Zegklitz 和 Petr Pošik。最先进的符号回归算法基准测试。《遗传编程与可进化机器》,第 1-29 页,2020 年。

[34] Patryk Orzechowski, William La Cava, and Jason H. Moore. Where are we now? A large benchmark study of recent symbolic regression methods. In Proceedings of the 2018 Genetic and Evolutionary Computation Conference, GECCO '18, April 2018. doi: 10.1145/3205455.3205539.
[34] Patryk Orzechowski, William La Cava 和 Jason H. Moore。现状如何?近期符号回归方法的大规模基准研究。载于《2018 年遗传与进化计算会议论文集》,GECCO '18,2018 年 4 月。doi: 10.1145/3205455.3205539。

[35] Martin Fowler. Continuous Integration. https://martinfowler.com/articles/continuousIntegration.html, 2006.
[35] Martin Fowler。持续集成。https://martinfowler.com/articles/continuousIntegration.html,2006 年。

[36] Joseph D. Romano, Trang T. Le, William La Cava, John T. Gregg, Daniel J. Goldberg, Natasha L. Ray, Praneel Chakraborty, Daniel Himmelstein, Weixuan Fu, and Jason H. Moore. PMLB v1.0: An open source dataset collection for benchmarking machine learning methods. arXiv:2012.00058 [cs], April 2021.
[36] Joseph D. Romano, Trang T. Le, William La Cava, John T. Gregg, Daniel J. Goldberg, Natasha L. Ray, Praneel Chakraborty, Daniel Himmelstein, Weixuan Fu, Jason H. Moore. PMLB v1.0:一个用于机器学习方法基准测试的开源数据集集合。arXiv:2012.00058 [cs],2021 年 4 月。

[37] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct):2825-2830, 2011.
[37] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg 等. Scikit-learn:Python 中的机器学习工具。Journal of Machine Learning Research,12(10 月):2825-2830, 2011。

[38] Michael Schmidt and Hod Lipson. Age-fitness pareto optimization. In Genetic Programming Theory and Practice VIII, pages 129-146. Springer, 2011.
[38] Michael Schmidt, Hod Lipson. 年龄-适应度帕累托优化。载于《遗传编程理论与实践 VIII》,第 129-146 页。Springer, 2011。

[39] William La Cava, Thomas Helmuth, Lee Spector, and Jason H. Moore. A probabilistic and multi-objective analysis of lexicase selection and epsilon-lexicase selection. Evolutionary Computation, 27(3):377-402, September 2019. ISSN 1063-6560. doi: 10.1162/evco_a_00224.
[39] William La Cava, Thomas Helmuth, Lee Spector, Jason H. Moore. 词典选择与ε-词典选择的概率性及多目标分析。Evolutionary Computation,27(3):377-402, 2019 年 9 月。ISSN 1063-6560。doi: 10.1162/evco_a_00224。

[40] William La Cava, Tilak Raj Singh, James Taggart, Srinivas Suri, and Jason H. Moore. Learning concise representations for regression by evolving networks of trees. In International Conference on Learning Representations, ICLR, 2019.
[40] William La Cava、Tilak Raj Singh、James Taggart、Srinivas Suri 和 Jason H. Moore。通过学习演化树网络来学习回归的简洁表示。发表于国际学习表示会议(ICLR),2019 年。

[41] Trent McConaghy. FFX: Fast, scalable, deterministic symbolic regression technology. In Genetic Programming Theory and Practice IX, pages 235-260. Springer, 2011.
[41] Trent McConaghy。FFX:快速、可扩展、确定性的符号回归技术。载于《遗传编程理论与实践 IX》,第 235-260 页。Springer,2011 年。

[42] Marco Virgolin, Tanja Alderliesten, Cees Witteveen, and Peter A N Bosman. Improving model-based genetic programming for symbolic regression of small expressions. Evolutionary Computation, page tba, 2020.
[42] Marco Virgolin、Tanja Alderliesten、Cees Witteveen 和 Peter A N Bosman。改进基于模型的遗传编程对小表达式符号回归的效果。《进化计算》,页码待定,2020 年。

[43] F. O. de Franca and G. S. I. Aldeia. Interaction-Transformation Evolutionary Algorithm for Symbolic Regression. Evolutionary Computation, pages 1-25, December 2020. ISSN 1063-6560. doi: 10.1162/ evco_a_00285.
[43] F. O. de Franca 和 G. S. I. Aldeia。符号回归的交互-转换进化算法。《进化计算》,第 1-25 页,2020 年 12 月。ISSN 1063-6560。doi: 10.1162/evco_a_00285。

[44] Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O'Reilly. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, pages 879-886. ACM, 2014.
[44] 伊格纳西奥·阿纳尔多、克日什托夫·克拉维茨与乌娜-梅·奥莱利。多回归遗传编程。载于《2014 年遗传与进化计算年会论文集》,第 879-886 页。ACM 出版社,2014 年。

[45] Michael Kommenda, Bogdan Burlacu, Gabriel Kronberger, and Michael Affenzeller. Parameter identification for symbolic regression using nonlinear least squares. Genetic Programming and Evolvable Machines, December 2019. ISSN 1573-7632. doi: 10.1007/s10710-019-09371-3.
[45] 迈克尔·科门达、博格丹·布尔拉库、加布里埃尔·克龙伯格与迈克尔·阿芬采勒。基于非线性最小二乘法的符号回归参数辨识。《遗传编程与可进化机器》,2019 年 12 月。ISSN 1573-7632。doi: 10.1007/s10710-019-09371-3。


[46] Marco Virgolin, Tanja Alderliesten, and Peter AN Bosman. Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1084-1092, 2019.
[46] 马尔科·维尔戈林、坦雅·阿尔德利斯顿与彼得·AN·博斯曼。基于语义反向传播遗传编程的符号回归线性缩放方法。载于《遗传与进化计算会议论文集》,第 1084-1092 页,2019 年。

[47] Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T Meyarivan. A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimization: NSGA-II. In Marc Schoenauer, Kalyanmoy Deb, Günther Rudolph, Xin Yao, Evelyne Lutton, Juan Julian Merelo, and Hans-Paul Schwefel, editors, Parallel Problem Solving from Nature PPSN VI, volume 1917, pages 849-858. Springer Berlin Heidelberg, Berlin, Heidelberg, 2000. ISBN 978-3-540-41056-0.
[47] 卡扬莫伊·德布、萨米尔·阿格拉瓦尔、阿姆里特·普拉塔普与 T·梅亚里万。一种快速精英非支配排序遗传算法用于多目标优化:NSGA-II。编者:马克·舍恩 auer 等,《自然并行问题求解第六卷》,第 1917 卷,第 849-858 页。施普林格柏林海德堡出版社,2000 年。ISBN 978-3-540-41056-0。

[48] Eckart Zitzler, Marco Laumanns, and Lothar Thiele. SPEA2: Improving the Strength Pareto Evolutionary Algorithm. Eidgenössische Technische Hochschule Zürich (ETH), Institut für Technische Informatik und Kommunikationsnetze (TIK), 2001.
[48] 埃卡特·齐茨勒(Eckart Zitzler)、马可·劳曼斯(Marco Laumanns)与洛塔尔·蒂勒(Lothar Thiele)。SPEA2:改进强度帕累托进化算法。苏黎世联邦理工学院(ETH),技术信息与通信网络研究所(TIK),2001 年。

[49] S. Bleuler, M. Brack, L. Thiele, and E. Zitzler. Multiobjective genetic programming: Reducing bloat using SPEA2. In Proceedings of the 2001 Congress on Evolutionary Computation, 2001, volume 1, pages 536-543 vol. 1, 2001. doi: 10.1109/CEC.2001.934438.
[49] S. 布洛伊勒(S. Bleuler)、M. 布拉克(M. Brack)、L. 蒂勒(L. Thiele)与 E. 齐茨勒(E. Zitzler)。多目标遗传规划:利用 SPEA2 减少膨胀。收录于《2001 年进化计算大会论文集》,2001 年,第 1 卷,第 536-543 页,2001 年。doi: 10.1109/CEC.2001.934438。

[50] Gregory S. Hornby. ALPS: The age-layered population structure for reducing the problem of premature convergence. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, GECCO '06, pages 815-822, New York, NY, USA, 2006. ACM. ISBN 1-59593-186-4. doi: 10.1145/ 1143997.1144142.
[50] 格雷戈里·S·霍恩比(Gregory S. Hornby)。ALPS:用于缓解早熟收敛问题的年龄分层种群结构。收录于《第 8 届遗传与进化计算年会论文集》,GECCO '06,第 815-822 页,美国纽约州纽约市,2006 年。ACM 出版。ISBN 1-59593-186-4。doi: 10.1145/1143997.1144142。

[51] M.D. Schmidt and H. Lipson. Coevolution of Fitness Predictors. IEEE Transactions on Evolutionary Computation, 12(6):736-749, December 2008. ISSN 1941-0026, 1089-778X. doi: 10.1109/TEVC.2008. 919006.
[51] M.D.施密特与 H.利普森合著。适应度预测因子的协同进化。《IEEE 进化计算汇刊》,第 12 卷第 6 期,第 736-749 页,2008 年 12 月。ISSN 1941-0026, 1089-778X。doi: 10.1109/TEVC.2008.919006。

[52] Raja Muhammad Atif Azad. Krzysztof Krawiec: Behavioral program synthesis with genetic programming. Genetic Programming and Evolvable Machines, 18(1):111-113, March 2017. ISSN 1389-2576, 1573-7632. doi: 10.1007/s10710-016-9283-7.
[52] Raja Muhammad Atif Azad。Krzysztof Krawiec:基于遗传编程的行为程序合成。《遗传编程与可进化机器》,第 18 卷第 1 期,第 111-113 页,2017 年 3 月。ISSN 1389-2576, 1573-7632。doi: 10.1007/s10710-016-9283-7。

[53] Bartosz Wieloch and Krzysztof Krawiec. Running programs backwards: Instruction inversion for effective search in semantic spaces. In Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, pages 1013-1020, 2013.
[53] Bartosz Wieloch 与 Krzysztof Krawiec 合著。逆向运行程序:语义空间高效搜索的指令反转技术。载于《第 15 届遗传与进化计算年会论文集》,第 1013-1020 页,2013 年。

[54] Krzysztof Krawiec and Tomasz Pawlak. Approximating geometric crossover by semantic backpropagation. In Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, pages 941-948, 2013.
[54] Krzysztof Krawiec 与 Tomasz Pawlak 合著。通过语义反向传播逼近几何交叉操作。载于《第 15 届遗传与进化计算年会论文集》,第 941-948 页,2013 年。

[55] Tomasz P Pawlak, Bartosz Wieloch, and Krzysztof Krawiec. Semantic backpropagation for designing search operators in genetic programming. IEEE Transactions on Evolutionary Computation, 19(3):326-340, 2014.
[55] 托马什·P·帕夫拉克、巴托斯·维洛克与克日什托夫·克拉维茨。遗传编程中搜索算子设计的语义反向传播。《IEEE 进化计算汇刊》,19(3):326-340,2014 年。

[56] Alexander Topchy and William F. Punch. Faster genetic programming based on local gradient search of numeric leaf values. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO- 2001), pages 155-162, 2001.
[56] 亚历山大·托普奇与威廉·F·庞奇。基于叶节点数值局部梯度搜索的快速遗传编程。载《遗传与进化计算会议论文集》(GECCO-2001),第 155-162 页,2001 年。

[57] J.C. Bongard and H. Lipson. Nonlinear System Identification Using Coevolution of Models and Tests. IEEE Transactions on Evolutionary Computation, 9(4):361-384, August 2005. ISSN 1089-778X. doi: 10.1109/TEVC.2005.850293.
[57] J·C·邦加德与 H·利普森。基于模型与测试协同进化的非线性系统辨识。《IEEE 进化计算汇刊》,9(4):361-384,2005 年 8 月。ISSN 1089-778X。doi: 10.1109/TEVC.2005.850293。

[58] Michael Kommenda, Gabriel Kronberger, Stephan Winkler, Michael Affenzeller, and Stefan Wagner. Effects of constant optimization by nonlinear least squares minimization in symbolic regression. In Christian Blum, Enrique Alba, Thomas Bartz-Beielstein, Daniele Loiacono, Francisco Luna, Joern Mehnen, Gabriela Ochoa, Mike Preuss, Emilia Tantar, and Leonardo Vanneschi, editors, GECCO '13 Companion: Proceeding of the Fifteenth Annual Conference Companion on Genetic and Evolutionary Computation Conference Companion, pages 1121-1128, Amsterdam, The Netherlands, 6. ACM. doi: doi:10.1145/ 2464576.2482691.
[58] Michael Kommenda、Gabriel Kronberger、Stephan Winkler、Michael Affenzeller 与 Stefan Wagner 合著。符号回归中通过非线性最小二乘优化常数的影响。收录于 Christian Blum、Enrique Alba、Thomas Bartz-Beielstein、Daniele Loiacono、Francisco Luna、Joern Mehnen、Gabriela Ochoa、Mike Preuss、Emilia Tantar 及 Leonardo Vanneschi 编辑的《GECCO '13 Companion:第十五届遗传与进化计算会议伴侣论文集》,第 1121-1128 页,荷兰阿姆斯特丹,6 月。ACM 出版社。doi: doi:10.1145/2464576.2482691。


[59] Bogdan Burlacu, Gabriel Kronberger, and Michael Kommenda. Operon C++ an efficient genetic programming framework for symbolic regression. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, pages 1562-1570, 2020.
[59] Bogdan Burlacu、Gabriel Kronberger 与 Michael Kommenda 合著。Operon C++:一种高效的符号回归遗传编程框架。收录于《2020 年遗传与进化计算会议伴侣论文集》,第 1562-1570 页,2020 年。

[60] Fabrício Olivetti de França. A greedy search tree heuristic for symbolic regression. Information Sciences, 442:18-32, 2018.
[60] Fabrício Olivetti de França 著。符号回归的贪心搜索树启发式方法。《信息科学》,第 442 卷,第 18-32 页,2018 年。

[61] Marco Virgolin, Tanja Alderliesten, Cees Witteveen, and Peter A N Bosman. Scalable genetic programming by gene-pool optimal mixing and input-space entropy-based building-block learning. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1041-1048, 2017.
[61] Marco Virgolin, Tanja Alderliesten, Cees Witteveen, Peter A N Bosman. 基于基因池最优混合与输入空间熵的构建块学习的可扩展遗传规划。载于《遗传与进化计算会议论文集》,第 1041-1048 页,2017 年。

[62] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229-256, 1992.
[62] Ronald J. Williams. 连接主义强化学习的简单统计梯度跟随算法。《机器学习》,8(3-4):229-256,1992 年。

[63] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked Science in Machine Learning. SIGKDD Explorations, 15(2):49-60, 2013. doi: 10.1145/2641190.2641198.
[63] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, Luis Torgo. OpenML:机器学习中的网络化科学。《SIGKDD 探索》,15(2):49-60,2013 年。doi: 10.1145/2641190.2641198。

[64] M. Lichman. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2013.
[64] M. Lichman. UCI 机器学习数据库。加州大学欧文分校信息与计算机科学学院,2013 年。

[65] Richard P. Feynman, Robert B. Leighton, and Matthew Sands. The Feynman Lectures on Physics, Vol. I: The New Millennium Edition: Mainly Mechanics, Radiation, and Heat. Basic Books, September 2015. ISBN 978-0-465-04085-8.
[65] 理查德·P·费曼、罗伯特·B·莱顿、马修·桑兹。《费曼物理学讲义,第一卷:新千年版:主要涵盖力学、辐射与热学》。基础图书出版社,2015 年 9 月。ISBN 978-0-465-04085-8。

[66] Steven H Strogatz. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. Westview press, 2014.
[66] 史蒂文·H·斯特罗加茨。《非线性动力学与混沌:在物理、生物、化学及工程中的应用》。Westview 出版社,2014 年。

[67] William La Cava, Kourosh Danai, and Lee Spector. Inference of compact nonlinear dynamic models by epigenetic local search. Engineering Applications of Artificial Intelligence, 55:292-306, October 2016. ISSN 0952-1976. doi: 10.1016/j.engappai.2016.07.004.
[67] 威廉·拉卡瓦、库罗什·达奈、李·斯佩克特。通过表观遗传局部搜索推断紧凑非线性动态模型。《工程人工智能应用》,55 卷,292-306 页,2016 年 10 月。ISSN 0952-1976。doi: 10.1016/j.engappai.2016.07.004。

[68] E.J. Vladislavleva, G.F. Smits, and D. den Hertog. Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming. IEEE Transactions on Evolutionary Computation, 13(2):333-349, 2009. ISSN 1089-778X. doi: 10.1109/TEVC.2008.926486.
[68] E·J·弗拉迪斯拉夫列娃、G·F·斯米茨、D·登赫托格。非线性阶数作为通过帕累托遗传编程生成符号回归模型的复杂度度量。《IEEE 进化计算汇刊》,13 卷 2 期,333-349 页,2009 年。ISSN 1089-778X。doi: 10.1109/TEVC.2008.926486。

[69] Michael Kommenda, Gabriel Kronberger, Michael Affenzeller, Stephan M. Winkler, and Bogdan Burlacu. Evolving Simple Symbolic Regression Models by Multi-objective Genetic Programming. In Genetic Programming Theory and Practice, volume XIV of Genetic and Evolutionary Computation. Springer, Ann Arbor, MI, 2015.
[69] Michael Kommenda, Gabriel Kronberger, Michael Affenzeller, Stephan M. Winkler 与 Bogdan Burlacu 合著。《通过多目标遗传规划演化简单符号回归模型》。收录于《遗传规划理论与实践》第 XIV 卷《遗传与进化计算》。Springer 出版社,美国密歇根州安娜堡,2015 年。

[70] Marco Virgolin, Andrea De Lorenzo, Eric Medvet, and Francesca Randone. Learning a formula of interpretability to learn interpretable formulas. In International Conference on Parallel Problem Solving from Nature, pages 79-93. Springer, 2020.
[70] Marco Virgolin, Andrea De Lorenzo, Eric Medvet 与 Francesca Randone 合著。《学习可解释性公式以推导可解释公式》。发表于《自然并行问题求解国际会议》,第 79-93 页。Springer 出版社,2020 年。

[71] W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116 (44):22071-22080, 10 2019-10-29. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1900654116.
[71] W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl 与 Bin Yu 合著。《可解释机器学习的定义、方法及应用》。《美国国家科学院院刊》第 116 卷第 44 期,22071-22080 页,2019 年 10 月 29 日。ISSN 0027-8424, 1091-6490。doi: 10.1073/pnas.1900654116。

[72] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, pages 785-794. ACM, 2016.
[72] 陈天奇与 Carlos Guestrin 合著。《XGBoost:一种可扩展的树提升系统》。发表于《第 22 届 ACM SIGKDD 知识发现与数据挖掘国际会议论文集》,第 785-794 页。ACM 出版社,2016 年。

[73] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30:3146-3154, 2017.
[73] 柯国霖、孟琪、Thomas Finley、王太峰、陈伟、马卫东、叶启威、刘铁岩。LightGBM:一种高效的梯度提升决策树。《神经信息处理系统进展》,30:3146-3154,2017 年。

[74] Leo Breiman. Random forests. Machine learning, 45(1):5-32, 2001.
[74] Leo Breiman。随机森林。《机器学习》,45(1):5-32,2001 年。


[75] Robert E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification, pages 149-171. Springer, 2003.
[75] Robert E. Schapire。机器学习中的提升方法:概述。《非线性估计与分类》,149-171 页。Springer,2003 年。

[76] Michael D Schmidt, Ravishankar R Vallabhajosyula, Jerry W Jenkins, Jonathan E Hood, Abhishek S Soni, John P Wikswo, and Hod Lipson. Automated refinement and inference of analytical models for metabolic networks. Physical Biology, 8(5):055011, October 2011. ISSN 1478-3975. doi: 10.1088/1478-3975/8/5/ 055011.
[76] Michael D Schmidt、Ravishankar R Vallabhajosyula、Jerry W Jenkins、Jonathan E Hood、Abhishek S Soni、John P Wikswo、Hod Lipson。代谢网络分析模型的自动化优化与推断。《物理生物学》,8(5):055011,2011 年 10 月。ISSN 1478-3975。doi: 10.1088/1478-3975/8/5/055011。

[77] Michael Schmidt and Hod Lipson. Comparison of Tree and Graph Encodings As Function of Problem Complexity. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, GECCO '07, pages 1674-1679, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-697-4. doi: 10.1145/1276958.1277288.
[77] Michael Schmidt 与 Hod Lipson。问题复杂度视角下树与图编码方式的比较。载于《第 9 届遗传与进化计算年会论文集》(GECCO '07),第 1674-1679 页,美国纽约州纽约市,2007 年。ACM 出版社。ISBN 978-1-59593-697-4。doi: 10.1145/1276958.1277288。

[78] Grant Dick, Caitlin A. Owen, and Peter A. Whigham. Feature standardisation and coefficient optimisation for effective symbolic regression. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO '20, pages 306-314, Cancún, Mexico, June 2020. Association for Computing Machinery. ISBN 978-1-4503-7128-5. doi: 10.1145/3377930.3390237.
[78] Grant Dick、Caitlin A. Owen 及 Peter A. Whigham。符号回归中特征标准化与系数优化的有效性研究。载于《2020 年遗传与进化计算会议论文集》(GECCO '20),第 306-314 页,墨西哥坎昆市,2020 年 6 月。美国计算机协会。ISBN 978-1-4503-7128-5。doi: 10.1145/3377930.3390237。

[79] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. arXiv:1711.05144 [cs], December 2018.
[79] Michael Kearns、Seth Neel、Aaron Roth 与 Zhiwei Steven Wu。防止公平性选区划分操纵:子群公平性审计与学习。arXiv:1711.05144 [cs],2018 年 12 月。

[80] William La Cava and Jason H. Moore. Genetic programming approaches to learning fair classifiers. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO '20, 2020. doi: 10.1145/3377930.3390157.
[80] William La Cava 与 Jason H. Moore。学习公平分类器的遗传编程方法。载于《2020 年遗传与进化计算会议论文集》(GECCO '20),2020 年。doi: 10.1145/3377930.3390157。

[81] Jerome H Friedman. Greedy function approximation: A gradient boosting machine. Annals of statistics, pages 1189-1232, 2001.
[81] Jerome H Friedman. 贪婪函数逼近法:梯度提升机。统计年鉴,第 1189-1232 页,2001 年。

[82] Janez Demšar. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7(Jan):1-30, 2006. ISSN ISSN 1533-7928.
[82] Janez Demšar。多数据集分类器的统计比较。机器学习研究期刊,7(1 月刊):1-30,2006 年。ISSN 1533-7928。

Checklist  检查清单


  1. For all authors...  对于所有作者...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? [Yes] The results can be verified by visiting our repository. Specific claims are supported by statistical tests.
(a) 摘要和引言中提出的主要主张是否准确反映了论文的贡献和范围?[是] 结果可通过访问我们的代码库验证。具体主张均有统计检验支持。

(b) Did you describe the limitations of your work? [Yes] See discussion and conclusions.
(b) 您是否描述了工作的局限性?[是] 参见讨论与结论部分。

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Sec. A.4.
(c) 您是否讨论了工作可能带来的负面社会影响?[是] 参见附录 A.4 节。

(d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] In addition to releasing the benchmark in a transparent way, we discuss ethics in Sec. A.4.
(d) 您是否阅读了伦理审查指南并确保论文符合要求?[是] 除了以透明方式发布基准测试外,我们还在附录 A.4 节讨论了伦理问题。

  1. If you ran experiments (e.g. for benchmarks)...
    如果您进行了实验(例如基准测试)...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See https://github.com/EpistasisLab/srbench.
(a)你是否包含了重现主要实验结果所需的代码、数据和说明(无论是在补充材料中还是以 URL 形式提供)?[是] 参见 https://github.com/EpistasisLab/srbench。

(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Table 2.
(b)你是否详细说明了所有训练细节(例如数据划分、超参数及其选择方式)?[是] 参见表 2。

(c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Figs. 1-3 for example.
(c)你是否报告了误差条(例如通过多次运行实验后关于随机种子的结果)?[是] 示例见图 1-3。


(d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] see Appendix
(d)你是否说明了总计算量和所用资源类型(例如 GPU 型号、内部集群或云服务提供商)?[是] 参见附录。

  1. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
    如果您使用了现有资产(如代码、数据、模型)或策划/发布了新资产...

(a) If your work uses existing assets, did you cite the creators? [Yes]
(a)您的工作若使用了现有资源,是否已注明原创者?[是]

(b) Did you mention the license of the assets? [Yes]
(b)是否提及了资源的许可协议?[是]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
(c)是否在补充材料或通过 URL 提供了任何新增资源?[是]

(d) Did you discuss whether and how consent was obtained from people whose data you're using/curating? [Yes] Datasets are released under an MIT license.
(d)是否讨论了如何从数据提供者/整理者处获得使用许可?[是] 数据集基于 MIT 许可证发布。

(e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] These datasets do not contain personally identifiable information.
(e)您是否讨论了所使用/整理的数据是否包含个人身份信息或冒犯性内容?[是] 这些数据集不包含个人身份信息。


A Appendix  附录


Please refer to https://github.com/EpistasisLab/srbench/ for the most up-to-date guide to SRBench.
请访问 https://github.com/EpistasisLab/srbench/ 获取 SRBench 的最新指南。

A.1 Running the Benchmark
A.1 运行基准测试


The README in our Github repository includes the set of commands to reproduce the benchmark experiment, which are summarized here. Experiments are launched from the experiments/ folder via the script analyze.py. The script can be configured to run the experiment in parallel locally, on an LSF job scheduler, or on a SLURM job scheduler. To see the full set of options, run python analyze.py -h.
我们 Github 仓库中的 README 文件包含了复现基准测试实验所需的命令集,此处进行简要总结。实验通过 experiments/目录下的 analyze.py 脚本启动。该脚本可配置为在本地并行运行、通过 LSF 作业调度器或 SLURM 作业调度器运行。查看完整选项列表,请运行 python analyze.py -h。

After installing and configuring the conda environment, the complete black-box experiment can be started via the command:
安装并配置 conda 环境后,可通过以下命令启动完整的黑盒实验:



python analyze.py /path/to/pmlb/datasets -n_trials 10 -results
../results -time_limit 48:00



Similarly, the ground-truth regression experiment for Strogatz datasets and a target noise of 0.0 are run by the command:
类似地,针对 Strogatz 数据集和目标噪声为 0.0 的真实回归实验,通过以下命令运行:



python analyze.py -results ../results_sym_data -target_noise
0.0 "/path/to/pmlb/datasets/strogatz*" -sym_data -n_trials 10
-time_limit 9:00 -tuned



A.2 Contributing a Method
A.2 贡献方法


A living version of the method contribution instructions are described in the Contribution Guide. To illustrate the simplicity of contributing a method, Figure 4 shows the script submitted for Bayesian Symbolic Regression [16]. In addition to the code snippet, authors may either add their code package to the conda/pip environment, or provide an install script. When a pull request is issued by a contributor, new methods and installs are automatically tested on a minimal version of the benchmark. Once the tests pass and the method is approved by the benchmark maintainers, the contribution becomes part of the resource and can be tested via the commands above.
方法贡献指南的实时版本详见《贡献指南》。为说明贡献方法的简易性,图 4 展示了贝叶斯符号回归[16]提交的脚本。除代码片段外,作者可选择将代码包添加至 conda/pip 环境,或提供安装脚本。当贡献者发起拉取请求时,新方法和安装程序会在精简版基准测试中自动验证。测试通过且经基准维护者批准后,该贡献即成为资源组成部分,可通过上述命令进行测试。

A.3 Additional Background and Motivation
A.3 补充背景与动机


Eureqa Eureqa is a commercial GP-based SR software that was acquired by DataRobot in 2017 7 . Due to its closed-source nature and incorporation into the DataRobot platform, it is impossible to benchmark its performance while controlling for important experimental variables such as number of evaluations, space and time limits, population size, and so forth. However, the novel algorithmic aspects of Eureqa are rooted in a number of ablation studies [38, 51, 77] that we summarize here. First is its use of directed acyclic graphs for representing equations in lieu of trees, which resulted in more space-efficient model encoding relative to trees, without a significant difference in accuracy [77]. The most significant improvement over traditional tournament-based selection is Eureqa's use of age-fitness Pareto optimization (AFP), a method in which random restarts are incorporated each generation as new offspring, and are protected from competing with older, more fit equations by including age as an objective to be minimized [38]. Eureqa also includes the co-evolution of fitness predictors, in which fitness assignment is sped up by optimizing a second population of training sample indices that best distinguish between equations in the population [51]. Unfortunately we cannot guarantee that Eureqa currently uses any of these reported algorithms for SR, due to its closed-source nature. We chose instead to benchmark known algorithms (AFP, AFP_FE) with open-source implementations, hoping that the resulting study's conclusions may better inform future methods development. We note that AFP has been outperformed by a number of other optimization methods in controlled studies since its release (e.g., [27, 28]).
Eureqa 是一款基于遗传编程(GP)的商业符号回归(SR)软件,于 2017 年被 DataRobot 收购 7 。由于其闭源特性及并入 DataRobot 平台,在控制评估次数、时空限制、种群规模等重要实验变量的情况下无法对其性能进行基准测试。然而,Eureqa 的创新算法核心源自多项消融研究[38,51,77],我们在此进行总结:首先,它采用有向无环图替代树结构表示方程,相较于树结构实现了更高效的空间模型编码,且精度无显著差异[77];其最显著的改进在于用年龄-适应度帕累托优化(AFP)替代传统锦标赛选择——该方法每代将随机重启个体作为新子代引入,并通过将年龄作为最小化目标来防止其与更早生成的高适应度方程竞争[38]。 Eureqa 还包含了适应度预测器的协同进化机制,通过优化第二组最能区分种群中方程的训练样本索引来加速适应度分配[51]。遗憾的是,由于 Eureqa 闭源的特性,我们无法保证其当前使用了文献报道的任何符号回归算法。因此我们选择对已知开源算法(AFP、AFP_FE)进行基准测试,希望研究结论能为未来方法开发提供更有效的参考。需要指出的是,自 AFP 发布以来,多项对照研究已表明其性能被其他优化方法超越(例如[27,28])。




7 https://www.datarobot.com/nutonian/








#method: Bayesian Symbolic Regression
#方法: 贝叶斯符号回归
#contributor: Ying Jin  #贡献者: 金英
#source: https://github.com/ying531/MCMC-SymReg
#来源: https://github.com/ying531/MCMC-SymReg
from bsr.bsr_class import BSR
从 bsr.bsr_class 导入 BSR 类
hyper_params = []  超参数列表 = []
for val, itrNum in zip([100,500,1000], [5000,1000,500]):
对于 val, itrNum 在 zip([100,500,1000], [5000,1000,500])中:
for treeNum in [3,6] :
对于 treeNum 在 [3,6] 中:
hyper_params.append( )  超参数.append( )
{'treeNum': [treeNum],  {'树数量': [treeNum],
'itrNum': [itrNum],  '迭代次数': [itrNum],
'val': [val],  '值': [val],
})
#default estimator  #默认估计器
est = BSR(val=100, itrNum=5000, treeNum=3, alpha1=0.4, alpha2=0.4,
beta=-1, disp=False, max_time=2*60*60)
def complexity(est):
"""returns final model complexity"""
"返回最终模型复杂度"
return est.complexity( )  返回 est.complexity()
def model (est):  def 模型(est):
"""returns final model as string"""
"以字符串形式返回最终模型"
return est.model ( )  返回 est.model()



Figure 4: An example code contribution, defining the estimator, its hyperparameters, and functions to return the complexity and symbolic model.
图 4:一个代码贡献示例,定义了估计器、其超参数以及返回复杂度和符号模型的函数。


Constant optimization in Genetic Programming One of the clearest improvements over Koza-style GP has been the adoption of local search methods to handle constant optimization distinctly from evolutionary learning. Regarding the optimization of constants in GP, several reasons can explain why backpropagation and gradient descent can be considered to be relatively under-used in GP (compared to, e.g., evolutionary neural architecture search). For example, early works often ignored the use of feature standardization (e.g., by z-scoring), the lack of which can harm gradient propagation [78]. Next to this, GP relies on crafting compositions out of a multitude of operations, some of which are prone to cause vanishing or exploding gradients. Last but not least, to the best of our knowledge, the field lacks a comprehensive study that provides guidelines for the appropriate hyperparameters for constant optimization (learning rate schedule, iterations, batch size, etc.), and how to effectively balance parameter learning with the evolutionary process.
遗传编程中的常数优化 相较于 Koza 式遗传编程最显著的改进之一,是采用局部搜索方法将常数优化与进化学习明确区分处理。关于遗传编程中常数的优化,有若干原因可以解释为何反向传播和梯度下降在遗传编程中的使用相对较少(例如与进化神经网络架构搜索相比)。例如,早期研究常忽略特征标准化(如通过 z-score 标准化)的使用,而缺乏标准化会损害梯度传播[78]。此外,遗传编程依赖于从多种运算中构建组合,其中某些运算容易导致梯度消失或爆炸。最后但同样重要的是,据我们所知,该领域缺乏一项全面研究,为常数优化提供适当超参数(学习率调度、迭代次数、批量大小等)的指导原则,以及如何有效平衡参数学习与进化过程。

A.4 Additional Dataset Information
A.4 附加数据集信息


All datasets, including metadata, are available from PMLB. Each dataset is stored using Git Large File Storage and PMLB is planned for long-term maintenance. PMLB is available under an MIT
所有数据集(包括元数据)均可从 PMLB 获取。每个数据集使用 Git 大文件存储进行存储,且 PMLB 计划进行长期维护。PMLB 基于 MIT 许可证提供




https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_19.jpg?x=489&y=214&w=810&h=546&r=0

Figure 5: Distribution of dataset sizes in PMLB.
图 5:PMLB 中数据集大小的分布情况


license, and is described in detail in Romano et al. [36]. The authors bear all responsibility in case of violation of rights.
许可,并在 Romano 等人的文献[36]中详细描述。若发生权利侵犯,作者承担全部责任。

Dataset Properties The distribution of dataset sizes by samples and features are shown in Fig. 5. Datasets vary in size from tens to millions of samples, and up to thousands of features. The datasets can be navigated and inspected in the repository documentation.
数据集属性 样本和特征的数据集大小分布如图 5 所示。数据集规模从数十到数百万样本不等,特征数量可达数千个。用户可通过仓库文档浏览和检查这些数据集。

Ethical Considerations and Intended Uses PMLB is intended to be used as a framework for benchmarking ML and SR algorithms and as a resource for investigating the structure of datasets. This paper does not contribute new datasets, but rather collates and standardizes datasets that were already publicly available. In that regard, we do not foresee SRBench as creating additional ethical issues around their use. Nevertheless, it is worth noting that PMLB contains well-known, real-world datasets from UCI and OpenML for which ethical considerations are important, such as the USCrime dataset. Whereas we would view the risk of harm arising specifically from this dataset to be low (the data is from 1960), it is exemplary of a task for which algorithmic decision making could exacerbate existing biases in the criminal justice system. As such it is used as a benchmark in a number of papers in the ML fairness literature (e.g. [79, 80]). None of the datasets herein contain personally identifiable information.
伦理考量与预期用途 PMLB 旨在作为评估机器学习和符号回归算法的基准框架,以及研究数据集结构的资源。本文并未贡献新的数据集,而是对已公开可用的数据集进行了整理和标准化。就此而言,我们不认为 SRBench 会引发额外的使用伦理问题。然而值得注意的是,PMLB 包含来自 UCI 和 OpenML 的知名真实世界数据集(如 USCrime 数据集),这些数据集的伦理考量至关重要。虽然我们认为该数据集(数据源自 1960 年)造成特定伤害的风险较低,但它代表了算法决策可能加剧刑事司法系统现有偏见的典型案例。因此,该数据集被用作机器学习公平性领域多篇论文的基准(例如[79, 80])。本文涉及的所有数据集均不包含个人身份信息。

Feynman datasets The Feynman benchmarks were sourced from the Feynman Symbolic Regression Database. We standardized the Feynman and Bonus equations to PMLB format and included metadata detailing the model form and the units for each variable. We used the version of the equations that were not simplified by dimensional analysis. Udrescu and Tegmark [18] describe each dataset as containing 105 rows,but each actually contains 106 . Given this discrepancy and after noting that sub-sampling did not significantly change the correlation structure of any of the problems, each dataset was down-sampled from 1 million samples to 100,000 to lower the computational burden. We also observed that Eqn. II.11.17 was missing from the database. Finally, we excluded three datasets from our analysis that contained arcsin and arccos functions, as these were not implemented in the majority of SR algorithms we tested.
Feynman 数据集 Feynman 基准测试来源于 Feynman 符号回归数据库。我们将 Feynman 和 Bonus 方程标准化为 PMLB 格式,并包含描述模型形式和每个变量单位的元数据。我们使用的是未经维度分析简化的方程版本。Udrescu 和 Tegmark[18]将每个数据集描述为包含 105 行,但实际上每个数据集包含 106 行。鉴于这一差异,并注意到子采样并未显著改变任何问题的相关性结构,我们将每个数据集从 100 万样本降采样至 10 万以降低计算负担。我们还观察到数据库中缺少方程 II.11.17。最后,我们从分析中排除了三个包含 arcsin 和 arccos 函数的数据集,因为这些函数在我们测试的大多数 SR 算法中未实现。


Strogatz datasets The Strogatz datasets were sourced from the ODE-Strogatz repository [67]. Each dataset is one state of a 2-state system of first-order, ordinary differential equations (ODEs). The goal of each problem is to predict rate of change of the state given the current two states on which it depends. Each represents natural processes that exhibit chaos and non-linear dynamics. The problems were originally adapted from [66] by Schmidt [25]. In order to simulate their behavior, initial conditions were chosen within stable basins of attraction. Each system was simulated using Simulink, and the simulation code is available in the repository above. The equations for each of these datasets are shown in Table 3.
Strogatz 数据集 Strogatz 数据集来源于 ODE-Strogatz 代码库[67]。每个数据集均为一个二阶一阶常微分方程(ODE)系统的单状态量,问题的目标是根据当前依赖的两个状态预测该状态的变化率。这些数据集均表征了展现混沌和非线性动力学特性的自然过程。问题最初由 Schmidt[25]从文献[66]中改编而来。为模拟其行为,初始条件设定在稳定的吸引域内,各系统均通过 Simulink 进行仿真,仿真代码可在上述代码库中获取。各数据集的对应方程如表 3 所示。


Table 3: The Strogatz ODE problems.
表 3:Strogatz 常微分方程问题集

Name  名称Target  目标变量
Bacterial Respiration  细菌呼吸x˙=20xxy1+0.5x2 y˙=10xy1+0.5x2
Bar Magnets  条形磁铁θ˙=0.5sin(θϕ)sin(θ) ϕ˙=0.5sin(ϕθ)sin(ϕ)
Glider  滑翔机v˙=0.05v2sin(θ) θ˙=vcos(θ)/v
Lotka-Volterra interspecies dynamics
洛特卡-沃尔泰拉种间动态
x˙=3x2xyx2 y˙=2yxyy2
Predator Prey  捕食者-猎物模型x˙=x(4xy1+x) y˙=y(x1+x0.075y)
Shear Flow  剪切流θ˙=cot(ϕ)cos(θ) ϕ˙=(cos2(ϕ)+0.1sin2(ϕ))sin(θ)
van der Pol oscillator  范德波尔振荡器x˙=10(y13(x3x)) y˙=110x


Adding Noise White gaussian noise was added to the target as a fraction of the signal root mean square value. In other words,for target noise level γ ,
添加噪声 向目标添加了白高斯噪声,其强度为信号均方根值的一定比例。换言之,对于目标噪声水平 γ

ynoise =y+ϵ,ϵN(0,γ1Nyi2)

A.5 Additional Experiment Details
A.5 额外实验细节


Experiments were run in a heterogeneous cluster computing environment composed of hosts with 24-28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz processors and 250 GB of RAM. Jobs consisted of the training of each method on a single dataset for a fixed random seed. Each job received one CPU core and up to 16GB of RAM,and was time-limited as shown in Table 2. For the ground-truth problems, the final models from each method were given an additional hour of computing time with 8GB of RAM to be simplified with sympy and assessed by the solution criteria (see Def. 4.1). For the black-box problems, if a job was killed due to the time limit, we re-ran the experiment without hyperparameter tuning, thereby only requiring a single training iteration to complete within 48 hours. To ease the computational burden for large datasets, training data exceeding 10,000 samples was randomly subset to 10,000 rows; test set predictions were still evaluated over the entire test fold.
实验在一个异构集群计算环境中运行,该环境由配备 24-28 核 Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz 处理器和 250GB 内存的主机组成。每个任务包括在固定随机种子下对单个数据集进行方法训练。每个任务分配一个 CPU 核心和最多 16GB 内存,并如表 2 所示设有时间限制。对于基准真值问题,每种方法的最终模型被额外分配一小时计算时间和 8GB 内存,用于通过 sympy 简化并依据解决方案标准进行评估(见定义 4.1)。对于黑盒问题,若任务因超时被终止,则在不进行超参数调优的情况下重新运行实验,仅需单次训练迭代在 48 小时内完成。为减轻大数据集的计算负担,超过 10,000 样本的训练数据被随机抽样至 10,000 行;测试集预测仍基于完整测试折叠进行评估。

The hyperparameter settings for each method are shown in Tables 4-6. Each SR method was tuned from a set of six hyperparameter combinations. The most common parameter setting chosen during the black-box regression experiments was then used as the "tuned" version of each algorithm for the ground-truth problems, with updates to 1) include any mathematical operators needed for those problems and 2) double the evaluation budget.
每种方法的超参数设置如表 4-6 所示。每种符号回归方法从六组超参数组合中进行调优。在黑盒回归实验中最常选用的参数配置随后被作为各算法在真实问题上的“调优”版本,并进行了两项更新:1) 包含解决这些问题所需的数学运算符;2) 将评估预算翻倍。



Table 4: ML methods and the hyperparameter spaces used in tuning.
表 4:机器学习方法及其在调优过程中使用的超参数空间。

Method  方法Hyperparameters  超参数
AdaBoost  AdaBoost 算法\{'learning_rate':(0.01,0.1,1.0,10.0),'n_estimators': (10,100,1000)}
{'学习率': (0.01,0.1,1.0,10.0), 'n_estimators': (10,100,1000)}
KernelRidge  核岭回归\{'kernel': ('linear', 'poly', 'rbf', 'sigmoid'), 'alpha': (0.0001, 0.01, 0.1, 1), 'gamma': (0.01, 0.1,1,10)}
{'核函数': ('linear', 'poly', 'rbf', 'sigmoid'), 'alpha': (0.0001, 0.01, 0.1, 1), 'gamma': (0.01, 0.1,1,10)}
LassoLars  拉索 LARS 回归\{'alpha':(0.0001,0.001,0.01,0.1,1)\}
{'alpha': (0.0001, 0.001, 0.01, 0.1, 1)}
LGBM\{'n_estimators': (10,50,100,250,500,1000),'learning_rate': (0.0001,0.01,0.05,0.1,0.2), 'subsample': (0.5, 0.75, 1), 'boosting_type': ('gbdt', 'dart', 'goss')\}
{'n_estimators': (10, 50, 100, 250, 500, 1000), 'learning_rate': (0.0001, 0.01, 0.05, 0.1, 0.2), 'subsample': (0.5, 0.75, 1), 'boosting_type': ('gbdt', 'dart', 'goss')}
LinearRegression  线性回归\{'fit_intercept': (True,) \}
{'fit_intercept': (True,)}
MLP\{'activation': ('logistic', 'tanh', 'relu'), 'solver': ('lbfgs', 'adam', 'sgd'), 'learning_rate': ('constant', 'invscaling', 'adaptive')\}
{'激活函数': ('逻辑函数', '双曲正切', '线性整流'), '求解器': ('lbfgs', 'adam', 'sgd'), '学习率': ('恒定', '反比例缩放', '自适应')}
RandomForest  随机森林\{'n_estimators': (10, 100, 1000), 'min_weight_fraction_leaf': (0.0, 0.25, 0.5), 'max_features': ('sqrt', 'log2', None)\}
{'树的数量': (10, 100, 1000), '叶子节点最小权重': (0.0, 0.25, 0.5), '最大特征数': ('平方根', '对数', '无')}
SGD\{'alpha': (1e-06, 0.0001, 0.01, 1), 'penalty': ('l2', '11', 'elasticnet')\}
{'alpha 参数': (1e-06, 0.0001, 0.01, 1), '正则化项': ('L2 范数', 'L1 范数', '弹性网络')}
XGB\{'n_estimators': (10,50,100,250,500,1000),'learning_rate': (0.0001,0.01,0.05,0.1,0.2), 'gamma':(0,0.1,0.2,0.3,0.4),'subsample': (0.5,0.75,1)}
{'n_estimators': (10,50,100,250,500,1000),'learning_rate': (0.0001,0.01,0.05,0.1,0.2), 'gamma':(0,0.1,0.2,0.3,0.4),'subsample': (0.5,0.75,1)}


A.6 Additional Results  A.6 附加结果


A.6.1 Subgroup analysis of black-box regression results
A.6.1 黑盒回归结果的子群分析


Many of the black-box problems for regression in PMLB were originally sourced from OpenML. A few authors have noted that several of these datasets are sourced from Friedman [81]'s synthetic benchmarks. These datasets are generated by non-linear functions that vary in degree of noise, variable interactions, variable importance, and degree of non-linearity. Due to their number, they may have an out-sized effect on results reporting in PMLB. In Fig. 6, we separate out results on this set of problems relative to the rest of PMLB. We do find that, relative to the rest of PMLB, the results on the Friedman datasets distinguish top-ranked methods more strongly than among the rest of the benchmark, on which performance between top-performing methods is more similar. In general, although we do see methods rankings change somewhat when looking at specific data groupings, we do not observe large differences. An exception is Kernel ridge regression, which performs poorly on the Friedman datasets but very well on the rest of PMLB. We recommend that future revisions to PMLB expand the dataset collection to minimize the effect of any one source of data, and include subgroup analysis to identify which types of problems are best solved by specific methods.
PMLB 中许多用于回归的黑盒问题最初来源于 OpenML。几位作者指出,其中部分数据集源自 Friedman[81]的合成基准测试。这些数据集由非线性函数生成,其噪声程度、变量交互、变量重要性及非线性程度各不相同。由于数量众多,它们可能对 PMLB 的结果报告产生不成比例的影响。在图 6 中,我们将这组问题的结果与 PMLB 其余部分分开呈现。我们发现,相较于 PMLB 的其他部分,Friedman 数据集上的结果能更显著地区分顶级方法,而在其他基准测试中,表现最佳的方法之间性能更为接近。总体而言,尽管在特定数据分组中方法排名有所变化,但我们并未观察到巨大差异。一个例外是核岭回归,它在 Friedman 数据集上表现不佳,但在 PMLB 其余部分表现非常出色。 我们建议 PMLB 的未来修订版扩展数据集收集,以最小化任何单一数据源的影响,并纳入亚组分析,以确定哪些类型的问题最适合通过特定方法解决。

To get a better sense of the performance variability across methods and datasets, method rankings on each dataset are bi-clustered and visualized in Fig. 7. Methods that perform most similarly across the benchmark are placed adjacent to each other, and likewise datasets that induce similar method rankings are grouped. We note some expected groupings first: AFP and AFP_FE, which differ only in fitness estimation, and FEAT and EPLEX, which use the same selection method, perform similarly. We also observe clustering among the Friedman datasets (names beginning with "fri_"), and again note stark differences between methods that perform well on these problems, e.g. Operon, SBP-GP, and FEAT, and those that do not, e.g. MLP. This view of the results also reveals a cluster of SR methods (AFP, AFP_FE, DSR, gplearn) that perform well on a subset of real-world problems (analcatdata_neavote_523 - vineyard_192) for which linear models also perform well. Interestingly, for that problem subset, Operon's performance is mediocre relative to its strong performance on other datasets. We also note with surprise that DSR and gplearn exhibit performance similarity on par with
为了更好地理解不同方法和数据集之间的性能差异,图 7 中对各数据集上的方法排名进行了双向聚类和可视化展示。在基准测试中表现最为相似的方法被相邻排列,同样,导致方法排名相似的数据集也被归为一组。首先我们注意到一些预期的分组:仅因适应度估计方式不同的 AFP 与 AFP_FE,以及采用相同选择方法的 FEAT 和 EPLEX,其表现较为相似。我们还观察到 Friedman 数据集(名称以"fri_"开头)自成聚类,并再次注意到在这些问题上表现优异的方法(如 Operon、SBP-GP 和 FEAT)与表现不佳的方法(如 MLP)之间存在显著差异。该结果视图还揭示了一个符号回归方法集群(AFP、AFP_FE、DSR、gplearn),它们在部分现实问题(analcatdata_neavote_523 至 vineyard_192)上表现良好,而线性模型在这些问题上同样表现优异。值得注意的是,对于该问题子集,Operon 的表现相对于其在其他数据集上的强劲表现显得平庸。我们还意外地发现,DSR 和 gplearn 的表现相似度堪比——



Table 5: Part 1: SR methods and the hyperparameter spaces used in tuning on the black-box regression problems.
表 5:第一部分:黑盒回归问题调优中使用的 SR 方法及其超参数空间。

Method  方法Hyperparameters  超参数
AFP\{'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt', 'sin', 'cos']\} \{'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt', 'sin', 'cos']\} \{'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt', 'sin', 'cos']\}
{'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt', 'sin', 'cos']} {'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt', 'sin', 'cos']} {'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt', 'sin', 'cos']}
AIFeynman  AI 费曼\{'BF_try_time': 60, 'NN_epochs': 4000, 'BF_ops_file_type'="10ops.txt"\} \{'BF_try_time': 60, 'NN_epochs': 4000, 'BF_ops_file_type'="14ops.txt"\} \{'BF_try_time': 60, 'NN_epochs': 4000, 'BF_ops_file_type'="19ops.txt"\} \{'BF_try_time': 600, 'NN_epochs': 400, 'BF_ops_file_type'="10ops.txt"\} \{'BF_try_time': 600, 'NN_epochs': 400, 'BF_ops_file_type'="14ops.txt"\} \{'BF_try_time': 600, 'NN_epochs': 400, 'BF_ops_file_type'="19ops.txt"\}
{'BF 尝试时间': 60, '神经网络训练轮数': 4000, 'BF 操作文件类型'="10ops.txt"} {'BF 尝试时间': 60, '神经网络训练轮数': 4000, 'BF 操作文件类型'="14ops.txt"} {'BF 尝试时间': 60, '神经网络训练轮数': 4000, 'BF 操作文件类型'="19ops.txt"} {'BF 尝试时间': 600, '神经网络训练轮数': 400, 'BF 操作文件类型'="10ops.txt"} {'BF 尝试时间': 600, '神经网络训练轮数': 400, 'BF 操作文件类型'="14ops.txt"} {'BF 尝试时间': 600, '神经网络训练轮数': 400, 'BF 操作文件类型'="19ops.txt"}
BSR\{'treeNum': 6, 'itrNum': 500, 'val': 1000\} \{'treeNum': 6, 'itrNum': 1000, 'val': 500\} \{'treeNum': 3, 'itrNum': 500, 'val': 1000\} \{'treeNum': 6, 'itrNum': 5000, 'val': 100\} \{'treeNum': 3, 'itrNum': 5000, 'val': 100\} \{'treeNum': 3, 'itrNum': 1000, 'val': 500\}
{'树数量': 6, '迭代次数': 500, '值': 1000} {'树数量': 6, '迭代次数': 1000, '值': 500} {'树数量': 3, '迭代次数': 500, '值': 1000} {'树数量': 6, '迭代次数': 5000, '值': 100} {'树数量': 3, '迭代次数': 5000, '值': 100} {'树数量': 3, '迭代次数': 1000, '值': 500}
DSR\{'batch_size': array([ 10, 100, 1000, 10000, 100000]) \}
{'批量大小': array([ 10, 100, 1000, 10000, 100000]) }
EPLEX\{'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']\}
{'种群大小': 1000, '代数': 250, '操作列表': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'种群大小': 500, '代数': 500, '操作列表': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'种群大小': 1000, '代数': 250, '操作列表': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']} {'种群大小': 100, '代数': 2500, '操作列表': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'种群大小': 100, '代数': 2500, '操作列表': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']} {'种群大小': 500, '代数': 500, '操作列表': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']}
FEAT\{'pop_size': 100, 'gens': 2500, 'lr': 0.1\} \{'pop_size': 100, 'gens': 2500, 'lr': 0.3\} \{'pop_size': 500, 'gens': 500, 'lr': 0.1\} \{'pop_size': 500, 'gens': 500, 'lr': 0.3\} \{'pop_size': 1000, 'gens': 250, 'lr': 0.1\} \{'pop_size': 1000, 'gens': 250, 'lr': 0.3\}
{'种群大小': 100, '代数': 2500, '学习率': 0.1} {'种群大小': 100, '代数': 2500, '学习率': 0.3} {'种群大小': 500, '代数': 500, '学习率': 0.1} {'种群大小': 500, '代数': 500, '学习率': 0.3} {'种群大小': 1000, '代数': 250, '学习率': 0.1} {'种群大小': 1000, '代数': 250, '学习率': 0.3}
FE_AFP\{'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']\} \{'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']\}
{'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']} {'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'popsize': 1000, 'g': 250, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'exp', 'log', '2', '3', 'sqrt']} {'popsize': 100, 'g': 2500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']} {'popsize': 500, 'g': 500, 'op_list': ['n', 'v', '+', '-', '*', '/', 'sin', 'cos', 'exp', 'log', '2', '3', 'sqrt']}


AFP/AFP_FE, and are the next most similar-performing methods (note the dendrogram connecting these columns).
AFP/AFP_FE 是接下来表现最为相似的方法(请注意连接这些列的树状图)。

A.6.2 Extended analysis of ground-truth regression results
A.6.2 对真实回归结果的扩展分析


As noted in Sec. 6, despite Operon's good performance on black-box regression, it finds few models with symbolic equivalence. An alternative (and weaker) notion of solution is based on test set accuracy, which we show in Fig. 8; by this metric, the relative method performance corresponds more closely to that seen for black-box regression. We also note that methods that impose structural assumptions on the model (BSR, FEAT, ITEA, FFX) are worse at finding symbolic solutions, most of which do not match those assumptions (e.g. most processes in Table 3).
如第 6 节所述,尽管 Operon 在黑盒回归中表现良好,但它在符号等价性方面找到的模型寥寥无几。另一种(且较弱的)解决方案概念基于测试集准确率,如图 8 所示;按此标准衡量,各方法的相对性能更接近于黑盒回归中的表现。我们还注意到,对模型施加结构假设的方法(BSR、FEAT、ITEA、FFX)在寻找符号解方面表现更差,其中大多数解并不符合这些假设(例如表 3 中的多数过程)。



Table 6: Part 2: SR methods and the hyperparameter spaces used in tuning on the black-box regression problems.
表 6:第二部分:在黑盒回归问题调优中使用的符号回归方法及超参数空间。

Method  方法Hyperparameters  超参数
GPGOMEA\{'initmaxtreeheight': (4,), 'functions': ('+_-_- p/_plog_sqrt_sin_cos',), 'popsize': (1000,), 'linearscaling': (True,)\} \{'initmaxtreeheight': (6,), 'functions': ('+_-_*_p/_plog_sqrt_sin_cos',), 'popsize': (1000,), 'linearscaling': (True,)\} \{'initmaxtreeheight': (4,), 'functions': ('+_-_*_p/'), 'popsize': (1000,), 'linearscaling': (True,) \} \{'initmaxtreeheight': (6,), 'functions': ('+_- "_p/'), 'popsize': (1000,), 'linearscaling': (True,) \} \{'initmaxtreeheight': (4,), 'functions': ('+_-_*_p/_plog_sqrt_sin_cos',), 'popsize': (1000,), 'linearscaling': (False,) } \{'initmaxtreeheight': (6,), 'functions': ('+_-_*_p/_plog_sqrt_sin_cos',), 'popsize': (1000,), 'linearscaling': (False,)\}
{'initmaxtreeheight': (4,), 'functions': ('+_-_- p/_plog_sqrt_sin_cos',), 'popsize': (1000,), 'linearscaling': (True,)} {'initmaxtreeheight': (6,), 'functions': ('+_-_*_p/_plog_sqrt_sin_cos',), 'popsize': (1000,), 'linearscaling': (True,)} {'initmaxtreeheight': (4,), 'functions': ('+_-_*_p/'), 'popsize': (1000,), 'linearscaling': (True,)} {'initmaxtreeheight': (6,), 'functions': ('+_- "_p/'), 'popsize': (1000,), 'linearscaling': (True,)} {'initmaxtreeheight': (4,), 'functions': ('+_-_*_p/_plog_sqrt_sin_cos',), 'popsize': (1000,), 'linearscaling': (False,) } {'initmaxtreeheight': (6,), 'functions': ('+_-_*_p/_plog_sqrt_sin_cos',), 'popsize': (1000,), 'linearscaling': (False,)}
ITEA\{'exponents': ((-5, 5),), 'termlimit': ((2, 15),), 'transfunctions': ('[Id, Tanh, Sin, Cos, Log, Exp, SqrtAbs]',)\} \{'exponents': ((-5, 5),), 'termlimit': ((2, 5),), 'transfunctions': ('[Id, Tanh, Sin, Cos, Log, Exp, SqrtAbs]’,)\} \{'exponents': ((-5, 5),), 'termlimit': ((2, 15),), 'transfunctions': ('[Id, Sin]',)\} \{'exponents': ((0,5),) ,'termlimit': ((2,15),) ,'transfunctions': ('[Id,Sin]',) \{'exponents': ((0,5),) ,'termlimit': ((2,5),) ,'transfunctions': ('[Id,Sin]',) \} \{'exponents': ((0,5),) ,'termlimit': ((2,15),) ,'transfunctions': ()[Id,Tanh,Sin,Cos,Log , Exp, SqrtAbs]',)\}
{'指数范围': ((-5, 5),), '项数限制': ((2, 15),), '转换函数': ('[恒等, 双曲正切, 正弦, 余弦, 对数, 指数, 绝对平方根]',)} {'指数范围': ((-5, 5),), '项数限制': ((2, 5),), '转换函数': ('[恒等, 双曲正切, 正弦, 余弦, 对数, 指数, 绝对平方根]',)} {'指数范围': ((-5, 5),), '项数限制': ((2, 15),), '转换函数': ('[恒等, 正弦]',)} {'指数范围': ((0,5),) ,'项数限制': ((2,15),) ,'转换函数': ('[Id,Sin]',) {'指数范围': ((0,5),) ,'项数限制': ((2,5),) ,'转换函数': ('[Id,Sin]',) \} {'指数范围': ((0,5),) ,'项数限制': ((2,15),) ,'转换函数': ()[Id,Tanh,Sin,Cos,Log , 指数, 绝对平方根]',)}
MRGP\{'popsize': 1000, 'g': 250, 'rt_cross': 0.8, 'rt_mut': 0.2\} \{'popsize': 100, 'g': 2500, 'rt_cross': 0.2, 'rt_mut': 0.8\} \{'popsize': 100, 'g': 2500, 'rt_cross': 0.8, 'rt_mut': 0.2\} \{'popsize': 500, 'g': 500, 'rt_cross': 0.2, 'rt_mut': 0.8\} \{'popsize': 500, 'g': 500, 'rt_cross': 0.8, 'rt_mut': 0.2\} \{'popsize': 1000, 'g': 250, 'rt_cross': 0.2, 'rt_mut': 0.8\}
{'种群大小': 1000, '代数': 250, '交叉率': 0.8, '变异率': 0.2} {'种群大小': 100, '代数': 2500, '交叉率': 0.2, '变异率': 0.8} {'种群大小': 100, '代数': 2500, '交叉率': 0.8, '变异率': 0.2} {'种群大小': 500, '代数': 500, '交叉率': 0.2, '变异率': 0.8} {'种群大小': 500, '代数': 500, '交叉率': 0.8, '变异率': 0.2} {'种群大小': 1000, '代数': 250, '交叉率': 0.2, '变异率': 0.8}
Operon  操作数\{'population_size': (500,), 'pool_size': (500,), 'max_length': (50,), 'allowed_symbols': ('add,mul,aq,constant,variable',), 'local_iterations': (5,), 'offspring_generator': ('basic',), 'tournament_size': (5,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,) \{'population_size': (500,), 'pool_size': (500,), 'max_length': (25,), 'allowed_symbols': ('add,mul,aq,exp,log,sin,tanh,constant,variable',), 'local_iterations': (5,), spring_generator': ('basic',), 'tournament_size': (5,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,) \} \{'population_size': (500,), 'pool_size': (500,), 'max_length': (25,), 'allowed_symbols': ('add,mul,aq,constant,variable',), 'local_iterations': (5,), 'offspring_generator': ('basic',), 'tournament_size': (5,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,)\} \{'population_size': (100,), 'pool_size': (100,), 'max_length': (50,), 'allowed_symbols' ('add,mul,aq,constant,variable',), 'local_iterations': (5,), 'offspring_generator': ('basic',), 'tournament_size': (3,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,)\} \{'population_size': (100,), 'pool_size': (100,), 'max_length': (25,), 'allowed_symbols': ('add,mul,aq,exp,log,sin,tanh,constant,variable',), 'local_iterations': (5,), 'of spring_generator': ('basic',), 'tournament_size': (3,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,) \} \{'population_size': (100,), 'pool_size': (100,), 'max_length': (25,), 'allowed_symbols': ('add,mul,aq,constant,variable',), 'local_iterations': (5,), 'offspring_generator': ('basic',), 'tournament_size': (3,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,)
{'population_size': (500,), 'pool_size': (500,), 'max_length': (50,), 'allowed_symbols': ('add,mul,aq,constant,variable',), 'local_iterations': (5,), 'offspring_generator': ('basic',), 'tournament_size': (5,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,)} {'population_size': (500,), 'pool_size': (500,), 'max_length': (25,), 'allowed_symbols': ('add,mul,aq,exp,log,sin,tanh,constant,variable',), 'local_iterations': (5,), 'offspring_generator': ('basic',), 'tournament_size': (5,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,)} {'population_size': (500,), 'pool_size': (500,), 'max_length': (25,), 'allowed_symbols': ('add,mul,aq,constant,variable',), 'local_iterations': (5,), 'offspring_generator': ('basic',), 'tournament_size': (5,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,)} {'population_size': (100,), 'pool_size': (100,), 'max_length': (50,), 'allowed_symbols': ('add,mul,aq,constant,variable',), 'local_iterations': (5,), 'offspring_generator': ('basic',), 'tournament_size': (3,), 'reinserter': ('keep-best',), 'max_evaluations': (500000,)} {'population_size': (100,), 'pool_size': (100,), 'max_length': (25,), 'allowed_symbols': ('加、乘、自适应量化、指数、对数、正弦、双曲正切、常数、变量',), '局部迭代次数': (5,), '后代生成器': ('基础型',), '锦标赛规模': (3,), '重插入策略': ('保留最优',), '最大评估次数': (500000,) \} \{'种群规模': (100,), '池大小': (100,), '最大长度': (25,), '允许符号': ('加、乘、自适应量化、常数、变量',), '局部迭代次数': (5,), '后代生成器': ('基础型',), '锦标赛规模': (3,), '重插入策略': ('保留最优',), '最大评估次数': (500000,)
gplearn  遗传规划学习\{'population_size': 100, 'generations': 5000, 'function_set': ('add', 'sub', 'mul', 'div', 'log', 'sqrt')\} \{'population_size': 1000, 'generations': 500, 'function_set': ('add', 'sub', 'bul', 'div', 'log', 'sqrt')\} \{'population_size': 1000, 'generations': 500, 'function_set': ('add', 'sub', 'mul', 'div', 'log', 'sqrt', 'sin', 'cos')\} \{'population_size': 500, 'generations': 1000, 'function_set': ('add', 'sub', 'mul', 'div', 'log', 'sqrt')\} \{'population_size': 500, 'generations': 1000, 'function_set': ('add', 'sub', 'mul', 'div', 'log', 'sqrt', 'sin', 'cos')\} \{'population_size': 100, 'generations': 5000, 'function_set': ('add', 'sub', 'mul', 'div', 'log', 'sqrt', 'sin', 'cos')\}
\{'种群规模': 100, '迭代代数': 5000, '函数集': ('加', '减', '乘', '除', '对数', '平方根')\} \{'种群规模': 1000, '迭代代数': 500, '函数集': ('加', '减', '乘', '除', '对数', '平方根')\} \{'种群规模': 1000, '迭代代数': 500, '函数集': ('加', '减', '乘', '除', '对数', '平方根', '正弦', '余弦')\} \{'种群规模': 500, '迭代代数': 1000, '函数集': ('加', '减', '乘', '除', '对数', '平方根')\} \{'种群规模': 500, '迭代代数': 1000, '函数集': ('加', '减', '乘', '除', '对数', '平方根', '正弦', '余弦')\} \{'种群规模': 100, '迭代代数': 5000, '函数集': ('加', '减', '乘', '除', '对数', '平方根', '正弦', '余弦')\}
sembackpropgp  语义回溯传播遗传规划\{'popsize': (1000,), 'functions': ('+_- "_aq_plog_sin_cos'), 'linearscaling': (False,), 'sbrdo': (0.9,), 'submut': (0.1,), 'tournament': (4,), 'maxsize': (250,), 'sblibtype': ('p_6_9999',)\} \{'popsize': (1000,), 'functions': ('+_- *_aq_plog_sin_cos',), 'linearscaling': (True,), 'sbrdo': (0.9,), 'submut': (0.1,), 'tournament': (4,), 'maxsize': (1000,)\} \{'popsize': (1000,), 'functions': ('+_- "_aq_plog_sin_cos'), 'linearscaling': (True, 'sbrdo': (0.9,), 'submut': (0.1,), 'tournament': (8,), 'maxsize': (1000,)\} \{'popsize': (1000,), 'functions': ('+_-_*_aq_plog_sin_cos',), 'linearscaling': (True,), 'sbrdo': (0.9,), 'submut': (0.1,), 'tournament': (4,), 'maxsize': (5000,)\} \{'popsize': (1000,), 'functions': ('+_-_*_aq_plog_sin_cos',), 'linearscaling': (True,), 'sbrdo': (0.9,), 'submut': (0.1,), 'tournament': (8,), 'maxsize': (5000,)\} \{'popsize': (10000,), 'functions': ('+_- *_aq_plog_sin_cos',), 'linearscaling': (False,), 'sbrdo': (0.9,), 'submut': (0.1,), 'tournament': (8,), 'maxsize': (250,), 'sblibtype': ('p_6_9999',) \}
{'种群规模': (1000,), '函数集': ('+_- "_aq_plog_sin_cos'), '线性缩放': (False,), '子树重组率': (0.9,), '子树变异率': (0.1,), '锦标赛规模': (4,), '最大尺寸': (250,), '子树库类型': ('p_6_9999',)} {'种群规模': (1000,), '函数集': ('+_- *_aq_plog_sin_cos',), '线性缩放': (True,), '子树重组率': (0.9,), '子树变异率': (0.1,), '锦标赛规模': (4,), '最大尺寸': (1000,)} {'种群规模': (1000,), '函数集': ('+_- "_aq_plog_sin_cos'), '线性缩放': (True, '子树重组率': (0.9,), '子树变异率': (0.1,), '锦标赛规模': (8,), '最大尺寸': (1000,)} {'种群规模': (1000,), '函数集': ('+_-_*_aq_plog_sin_cos',), '线性缩放': (True,), '子树重组率': (0.9,), '子树变异率': (0.1,), '锦标赛规模': (4,), '最大尺寸': (5000,)} {'种群规模': (1000,), '函数集': ('+_-_*_aq_plog_sin_cos',), '线性缩放': (True,), '子树重组率': (0.9,), '子树变异率': (0.1,), '锦标赛规模': (8,), '最大尺寸': (5000,)} {'种群规模': (10000,), '函数集': ('+_- *_aq_plog_sin_cos',), '线性缩放': (False,), '子树重组率': (0.9,), '子树变异率': (0.1,), '锦标赛规模': (8,), '最大尺寸': (250,), '子树库类型': ('p_6_9999',)}



https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_24.jpg?x=312&y=208&w=1174&h=566&r=0

Figure 6: Comparison of normalized R2 test scores on all black-box datasets,just the Friedman datatasets, and just the non-Friedman datasets.
图 6:在所有黑盒数据集、仅弗里德曼数据集以及仅非弗里德曼数据集上归一化 R2 测试分数的对比。


A.7 Statistical Tests  A.7 统计检验


Figures 9-11 give summary significance levels of pairwise tests of significance between estimators on the black-box and ground-truth problems. All pair-wise statistical tests are Wilcoxon signed-rank tests. A Bonferroni correction was applied,yielding the α levels given in each. This methodology for assessing statistical significance is based on the recommendations of Demšar [82] for comparing multiple estimators over many datasets. These figures are intended to complement Figures 1-3 in which effect sizes are shown.
图 9-11 展示了黑盒问题与真实问题上估计器之间成对显著性检验的总体显著性水平。所有成对统计检验均采用 Wilcoxon 符号秩检验,并应用 Bonferroni 校正,得出各图中标注的 α 水平。该统计显著性评估方法基于 Demšar[82]提出的多估计器跨数据集比较建议。这些图表旨在补充展示效应量的图 1-3。




https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_25.jpg?x=331&y=253&w=1155&h=1672&r=0

Figure 7: Rankings of methods by R2 test score on the black-box problems (lower is better). Results are bi-clustered by SR method (columns) and dataset (rows). Darker cells indicate that a method performs well on that dataset relative to its competitors. Note only a subset of the datasets are labelled due to space constraints.
图 7:黑盒问题上按 R2 测试分数排序的方法排名(数值越低越好)。结果按符号回归方法(列)和数据集(行)进行双聚类。颜色越深的单元格表示该方法在相应数据集上相对于竞争对手表现更优。注意因版面限制仅标注了部分数据集名称。



https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_26.jpg?x=552&y=353&w=691&h=630&r=0

Figure 8: Subset comparison of "Accuracy Solutions",i.e. models with R2>0.999 on the Feynman and Strogatz problems, differentiated by noise level.
图 8:"精确解"子集对比,即在 Feynman 和 Strogatz 问题上具有 R2>0.999 的模型,按噪声水平区分展示。


https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_26.jpg?x=309&y=1384&w=1186&h=468&r=0

Figure 9: Pairwise statistical comparisons on the black-box regression problems. Wilcoxon signed-rank tests are used with a Bonferonni correction on α for multiple comparisons. (Left) R2 test scores, (Right) model size.
图 9:黑箱回归问题的成对统计比较。采用 Wilcoxon 符号秩检验,并对 α 进行 Bonferonni 校正以处理多重比较。(左) R2 测试分数,(右)模型大小。



https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_27.jpg?x=304&y=382&w=1192&h=469&r=0

Figure 10: Pairwise statistical comparisons of R2 test scores on the ground-truth regression problems. We report Wilcoxon signed-rank tests with a Bonferonni correction on α for multiple comparisons. (Left) target noise of 0 , (Right) target noise of 0.01 .
图 10:真实回归问题中 R2 测试分数的成对统计比较。我们报告了 Wilcoxon 符号秩检验结果,并对 α 进行 Bonferonni 校正以处理多重比较。(左)目标噪声为 0,(右)目标噪声为 0.01。


https://cdn.noedgeai.com/0196aab0-d388-7063-b422-b5b4d15dc18f_27.jpg?x=304&y=1348&w=1191&h=468&r=0

Figure 11: Pairwise statistical comparisons of solution rates on the ground-truth regression problems. We report Wilcoxon signed-rank tests with a Bonferonni correction on α for multiple comparisons. (Left) target noise of 0 , (Right) target noise of 0.01 .
图 11:真实回归问题中解决率的成对统计比较。我们报告了 Wilcoxon 符号秩检验结果,并对 α 进行 Bonferonni 校正以处理多重比较。(左)目标噪声为 0,(右)目标噪声为 0.01。