Differential evolution (DE) is an efficient and effective global optimization algorithm, and many DE variants with adaptive strategy selection have been proposed. However, among the existing DE variants with adaptive strategy selection, only a few last generations of population information are utilized to conduct statistical analysis for strategy selection, and the population cumulative historical experience information does not be fully used to assist population to effectively search. In this paper, a DE with autonomous strategy selection (ASS-DE) is proposed to make full use of the population cumulative historical experience information. In ASS-DE, an individual can autonomously choose the better mutation strategy guided by the cumulative historical experience. As the same time, a parameter updating mechanism with archive is introduced to assign appropriate control parameters to the strategy. Additionally, an evolutionary learning mechanism based on individual similarity is proposed to guide the learning of historical experience information. To verify the performance of ASS-DE, CEC2015 and CEC2017 benchmark functions and the real-world optimization problem of wavelet parameter optimization in remote sensing image denoising is compared with some state-of-the-art intelligence algorithms. The experimental results show that ASS-DE obtains the promising performance compared with other DE variants. The source code of ASS-DE can be found in GitHub at https://github.com/Ramessis/ASS-DE_denoising.git. 差分进化 (DE) 是一种高效且有效的全局优化算法,已经提出了许多具有自适应策略选择的 DE 变体。然而,在现有的具有自适应策略选择的 DE 变体中,仅利用了少数最后一代种群信息进行策略选择的统计分析,并且种群累积历史经验信息没有被充分用于辅助种群有效搜索。该文提出了一种具有自主策略选择的 DE (ASS-DE),以充分利用种群累积历史经验信息。在 ASS-DE 中,个体可以在累积历史经验的指导下自主选择更好的突变策略。同时,引入了带有 archive 的参数更新机制,以便为策略分配适当的控制参数。此外,提出了一种基于个体相似性的进化学习机制来指导历史经验信息的学习。为了验证ASS-DE的性能,将CEC2015和CEC2017基准函数以及遥感图像去噪中小波参数优化的实际优化问题与一些最先进的智能算法进行了比较。实验结果表明,与其他 DE 变体相比,ASS-DE 获得了有希望的性能。ASS-DE 的源代码可以在 GitHub 上找到,网址为 https://github.com/Ramessis/ASS-DE_denoising.git。
1. Introduction 1. 引言
Global optimization problems (GOP) (Bangyal, Nisar, Ag. Ibrahim, Haque, Rodrigues, & Rawat, 2021) exist in various fields and are often complex, nonlinear, hybrid, multi-modal, and large-scale (Chen, Li, & Yang, 2020). Traditional mathematical methods are often inadequate for solving these complex problems. However, the use of meta-heuristic algorithms has proven to be effective in solving non-deterministic polynomial (NP) hard problems. These algorithms simulate specific mechanisms inspired by nature such as biological clusters, natural phenomena and social behavior, which can quickly and accurately find solutions to complex problems without the need for gradient descent (Beheshti & Shamsuddin, 2013). In the past research, many meta-heuristic algorithms have been proposed, such as genetic algorithm (GA) (Chen, Li, Tang & Liu, 2008), differential evolution (DE) (Storn & Price, 1997), ant colony optimization (ACO) (Guan, Zhao & Li, 2021), particle swarm optimization (PSO) (Wang, Hossein Gandomi, Yang & Hossein Alavi, 全局优化问题(GOP)(Bangyal, Nisar, Ag. Ibrahim, Haque, Rodrigues, & Rawat, 2021)存在于各个领域,并且通常是复杂的、非线性的、混合的、多模态的和大规模的(Chen, Li, & Yang, 2020)。传统的数学方法通常不足以解决这些复杂的问题。然而,事实证明,使用元启发式算法可以有效地解决非确定性多项式 (NP) 难题。这些算法模拟了受自然启发的特定机制,如生物集群、自然现象和社会行为,这些机制可以快速准确地找到复杂问题的解决方案,而无需梯度下降(Beheshti & Shamsuddin,2013)。在过去的研究中,提出了许多元启发式算法,如遗传算法(GA)(Chen, Li, Tang & Liu, 2008)、差分进化(DE)(Storn & Price, 1997)、蚁群优化(ACO)(Guan, Zhao & Li, 2021)、粒子群优化(PSO)(Wang, Hossein Gandomi, Yang & Hossein Alavi,
2014) and phase-based optimization (PBO) (Cao & Wang, 2017). 2014)和基于阶段的优化(PBO)(Cao & Wang,2017)。
Among these meta-heuristic algorithms, DE is highly recognized by most researchers owing to its good robustness, simplicity and strong convergence ability by individuals’ mutation, crossover, and selection operators, which has been successfully applied to various fields such as medical problem optimization (Tsompanas, Bull, Adamatzky & Balaz, 2021), engineering design (Yadav, Yadav, Kaur & Singh, 2021), path planning (Rodríguez-Molina, Solís-Romero, Villarreal-Cervantes, Serrano-Pérez & Flores-Caballero, 2021), computer vision (Kehar & Chopra, 2021), and etc. Because the different mutation strategies and control parameters of DE are suitable for different problems, and the selection of mutation strategies and parameters largely affects the performance of DE. Considering the motivational limitation of the paper, we mainly focus on the impact of mutation strategies of DE. Some strategies can better improve the global convergence ability, and other strategies pay more attention to the exploration ability of the algorithm (Biswas, Kundu & Das, 2014; Gong, Zhou & Cai, 2015; Deng, Xu, Song & 在这些元启发式算法中,DE因其良好的鲁棒性、简单性和个体突变、交叉和选择运算符的强大收敛能力而受到大多数研究人员的高度认可,并已成功应用于医疗问题优化(Tsompanas, Bull, Adamatzky & Balaz, 2021)、工程设计(Yadav, Yadav, Kaur & Singh, 2021)、路径规划(Rodríguez-Molina, Solís-Romero, Villarreal-Cervantes, Serrano-Pérez & Flores-Caballero, 2021), 计算机视觉(Kehar & Chopra, 2021)等。因为 DE 的不同突变策略和控制参数适用于不同的问题,而突变策略和参数的选择在很大程度上影响了 DE 的性能。考虑到论文的动机局限性,我们主要关注 DE 突变策略的影响。一些策略可以更好地提高全局收敛能力,而其他策略则更注重算法的探索能力(Biswas, Kundu & Das, 2014;龚,周和蔡,2015;邓、徐、宋 &
Zhao, 2021; Liu & Lampinen, 2005; Shao, Das, Konar & Chakraborty, 2005; Zhou, Wang, Zhou, Qiu, Bi & Cai, 2016;). In the past few decades, many DE variants with adaptive strategy selection methods have been proposed to solve different types of complex problems (Mallipeddi, Suganthan, Pan & Tasgetiren, 2011; Tanabe & Fukunaga, 2013; Wang, Cai & Zhang, 2011; Qin, Huang & Suganthan, 2008). For example, in the literature (Hassan, Hemeida, Alkhalaf, Mohamed & Senjyu, 2020), the authors divided the whole evolution process into five stages according to the time sequence through their own experience, and different mutation strategies are randomly selected in each stage to balance the exploration and exploitation performance of the algorithm. In literature ( Hu,Gong\mathrm{Hu}, \mathrm{Gong}&Li,2021\& \mathrm{Li}, 2021 ), a dynamic adjustment factor ww was used to control the proportion of the two mutation strategies in the evolution process, and an adaptive update method for ww was set to further improve the ability of the algorithm to jump out of the local optimization. Zhao, 2021;Liu & Lampinen, 2005;Shao, Das, Konar & Chakraborty, 2005年;周, 王, 周, 邱, 毕 & 蔡, 2016;)。在过去的几十年里,已经提出了许多具有适应性策略选择方法的DE变体来解决不同类型的复杂问题(Mallipeddi, Suganthan, Pan & Tasgetiren, 2011;Tanabe & Fukunaga, 2013;Wang, Cai & Zhang, 2011;Qin, Huang & Suganthan, 2008)。例如,在文献中(Hassan, Hemeida, Alkhalaf, Mohamed & Senjyu, 2020),作者通过自身经验,将整个进化过程按照时间顺序分为五个阶段,每个阶段随机选择不同的突变策略,以平衡算法的探索和利用性能。在文献 ( Hu,Gong\mathrm{Hu}, \mathrm{Gong}&Li,2021\& \mathrm{Li}, 2021 ) 中,使用动态调整因子 ww 来控制进化过程中两种突变策略的比例,并设置自适应更新方法 ww ,以进一步提高算法跳出局部优化的能力。
Although the above adaptive strategy selection methods have effectively improved the performance of DE , designers set the adaptive rules mainly based on their own cognition and experience on the problems solved, which are not suitable of the black box optimization problem with unknown characteristics. In addition, those adaptive methods lack the learning to generate effective historical experience of strategy selection. For example, although the literatures (Zhang & Sanderson, 2009) and (Mallipeddi, Suganthan, Pan & Tasgetiren, 2011) retain effective information generated during the evolutionary history stage, their mutation strategy decisions are only guided by historical information from recent few generations, resulting in poor robustness with the increase of the complexity of optimization problems. 虽然上述自适应策略选择方法有效地提高了 DE 的性能,但设计人员主要根据自身对所解决问题的认知和经验来设定自适应规则,这些规则并不适合于特性未知的黑盒优化问题。此外,这些自适应方法缺乏产生策略选择的有效历史经验的学习。例如,尽管文献(Zhang & Sanderson,2009)和(Mallipeddi,Suganthan,Pan & Tasgetiren,2011)保留了进化历史阶段产生的有效信息,但他们的突变策略决策仅由最近几代的历史信息指导,导致随着优化问题复杂性的增加,鲁棒性差。
Recently, reinforcement learning (RL) has showed beneficial performance in solving sequential decision-making problems (Rogova, Scott & Lolett, 2002), which has achieved good results in many practical application fields, such as intelligent decision-making (Yang, Xu & Wang, 2021), unmanned driving (Liu, Luosang, Yao & Su, 2021), intelligent control (Hasanzade & Koyuncu, 2021) and game development (Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra & Riedmiller, 2013). Nowadays, many scholars have further improved the performance of DE by using the autonomous decision-making search ability of RL (Cui, Hu & Rahmani, 2022; Hu, Gong & Li, 2021;Huynh, Do & Lee, 2021; Karafotias, Eiben & Hoogendoorn, 2014; Kim & Lee, 2009; Tan & Li, 2021; Wang, Cao, Du, Jia, Han, Tian & Liu, 2022; Wang, Zhao & Tang, 2022). Regarding mutation strategies autonomous decisionmaking, Zhao, Hu, Wang, Zhao & Tang (2022) proposed a reinforcement learning brain storm optimization algorithm (RLBSO) to solve the blindness of evolutionary search by accurately choosing the mutation strategy according to the feedback of each generation’s successful experience. Tan and Li (2021) designed a DE with mixed mutation strategies based on DQN (DE-DQN) by analyzing the fitness landscape of the optimization problem to predict the appropriate mutation strategy in each generation. Inspired by the autonomous decision-making mechanism of RL, Wang, Cao, Du, Jia, Han, Tian, and Liu (2022) proposed a differential evolution algorithm with autonomous selection mutation strategy and control parameters (ASDE). In ASDE, historical experience files with population characteristics are used to preserve the accumulated historical experience of mutation strategies and control parameter combinations. 最近,强化学习(RL)在解决顺序决策问题方面表现出有益的性能(Rogova,Scott & Lolett,2002年,在许多实际应用领域取得了良好的效果,如智能决策,Yang,Xu & Wang,无人驾驶,无人驾驶,Liu,Luosang,Yao&Su,2021,智能控制,Hasanzade & Koyuncu,2021)和游戏开发(Mnih,Kavukcuoglu,Silver,Graves,Antonoglou,Wierstra & Riedmiller, 如今,许多学者通过使用 RL 的自主决策搜索能力进一步提高了 DE 的性能(Cui, 胡 & Rahmani, 2022;胡、龚、李,2021 年;Huynh, Do & Lee, 2021;Karafotias, Eiben & Hoogendoorn, 2014;Kim & Lee, 2009;Tan & Li, 2021;Wang, Cao, Du, Jia, Han, Tian & Liu, 2022;Wang, Zhao & Tang, 2022)。关于变异策略自主决策,Zhao, 胡, Wang, Zhao & Tang (2022) 提出了一种强化学习头脑风暴优化算法 (RLBSO),通过根据每一代成功经验的反馈准确选择变异策略来解决进化搜索的盲目性。Tan 和 Li (2021) 通过分析优化问题的适应度景观来预测每一代的合适突变策略,从而设计了一个基于 DQN (DE-DQN) 的混合突变策略的 DE。受 RL 自主决策机制的启发,Wang, Cao, Du, Jia, Han, Tian, and Liu (2022) 提出了一种具有自主选择突变策略和控制参数 (ASDE) 的差分进化算法。在 ASDE 中,使用具有种群特征的历史经验文件来保存积累的突变策略和控制参数组合的历史经验。
Although the above adaptive strategy selection methods based on RL have effectively improved the individual’s autonomous decision-making ability, they neglect the rational setting of reward and lack effective use of experience of strategy selection. As an effective method of autonomous decision-making, RL can better play the autonomous learning role of the agent. By the selection of the optimal action through interaction with the environment, the robustness of the algorithm will be greatly improved. Compared with the evolution information of recent few generations, it is conceivable that the effective use of whole cumulative evolution history experience information of parameter control and strategy selection can provide a more exhaustive guidance for population search. 上述基于RL的自适应策略选择方法虽然有效提高了个体的自主决策能力,但忽视了奖励的合理设置,缺乏策略选择经验的有效利用。RL 作为一种有效的自主决策方法,可以更好地发挥智能体的自主学习作用。通过与环境交互来选择最优动作,算法的鲁棒性将大大提高。与近几代的进化信息相比,可以想象,有效利用参数控制和策略选择的整个累积进化历史经验信息,可以为种群搜索提供更详尽的指导。
Based on the above considerations, this paper introduces the autonomous decision-making ability of RL into DE and proposes a differential evolution with autonomous strategy (ASS-DE) that uses RL to learn individuals’ whole historical experience. ASS-DE regards an individual in the population as an agent, and the cumulative historical information obtained by the individual in the evolution is used to realize the autonomous decision-making of different individuals in different stages of evolution. ASS-DE also uses the fitness improvement rate of individuals and the success probability of executing mutation strategy as historical experience information to guide the evolution of the population. The main contributions of this paper are as follows. 基于上述考虑,本文将 RL 的自主决策能力引入 DE,并提出了一种使用 RL 来学习个体整个历史经验的具有自主策略的差分进化 (ASS-DE)。ASS-DE 将种群中的个体视为主体,利用个体在进化过程中获得的累积历史信息,实现不同进化阶段不同个体的自主决策。ASS-DE 还使用个体的体能提升率和执行突变策略的成功概率作为历史经验信息来指导种群的进化。本文的主要贡献如下。
(1) An autonomous strategy selection method for DE is proposed, which regards an individual in population as an agent and records the decision information generated by the individual in the evolution history stage. By using RL to learn cumulative historical information, the potential value of population experience on strategy selection will make individuals autonomous decision more accurate and flexible. (1)提出了一种 DE 的自主策略选择方法,该方法将种群中的个体作为代理,记录个体在进化历史阶段产生的决策信息。通过使用 RL 来学习累积的历史信息,群体经验在策略选择中的潜在价值将使个体自主决策更加准确和灵活。
(2) A parameter updating mechanism with archive is proposed. The successful parameter information corresponding to different mutation strategies is stored in the archive, and new parameters are generated through the archive information. In this way, the search behavior of each mutation strategy is regulated more reasonably and the ability of autonomous decision-making is realized more efficiently. (2) 提出了一种带存档的参数更新机制。将不同突变策略对应的成功参数信息存储在档案中,通过档案信息生成新的参数。通过这种方式,可以更合理地调节每个变异策略的搜索行为,更高效地实现自主决策的能力。
(3) An evolutionary learning mechanism based on individual similarity is proposed to guide the learning of similar individuals by judging the linear correlation of evolutionary individuals. This mechanism can guide individuals to learn more accurately while improving learning efficiency. (3) 提出了一种基于个体相似性的进化学习机制,通过判断进化个体的线性相关性来指导相似个体的学习。这种机制可以引导个体在提高学习效率的同时更准确地学习。
The rest of this paper is structured as follows. Section 2 briefly describes the basic DE and common mutation strategies, and introduces the related knowledge of RL. Section 3 describes the proposed ASS-DE algorithm. Section 4 reports the experimental study on CEC2015 and CEC2017 and the results of wavelet parameter optimization in remote sensing image denoising, followed by the conclusions in Section 5. 本文的其余部分结构如下。第 2 节简要介绍了基本的 DE 和常见的突变策略,并介绍了 RL 的相关知识。第 3 节描述了建议的 ASS-DE 算法。第 4 节报告了CEC2015和CEC2017的实验研究以及遥感图像去噪中小波参数优化的结果,随后在第 5 节中得出结论。
2. Basic knowledge 2. 基础知识
2.1. Differential evolution 2.1. 差分进化
This section provides a brief introduction to the basic DE algorithm, which comprises of initialization, mutation, crossover, and selection. The evolutionary search process of DE can be described as follows. 本节简要介绍了基本的 DE 算法,包括初始化、突变、交叉和选择。DE 的进化搜索过程可以描述如下。
2.1.1. Initialization 2.1.1. 初始化
In the initialization process, the population PP is composed of NPN P solution vectors generated by random sampling in the solution space. The population of DE consists of NN numbers of DD-dimensional parameter vectors Pop ^(g)={ vec(X_(i)^(g))=(x_(i,1)^(g),x_(i,2)^(g),cdots,x_(i,D)^(g))∣i=1,2,cdots,NP}^{g}=\left\{\overrightarrow{X_{i}^{g}}=\left(x_{i, 1}^{g}, x_{i, 2}^{g}, \cdots, x_{i, D}^{g}\right) \mid i=1,2, \cdots, N P\right\}, where NPN P represents the population size, DD denotes the dimension of objective problem, and vec(X_(i)^(g))\overrightarrow{X_{i}^{g}} represents the ii-th individual of the population in generation gg. The individual generated by DE can be expressed as follows. 在初始化过程中,总体 PP 由 NPN P 解空间中随机采样生成的解向量组成。DE 的总体由 NNDD -维参数向量 Pop 的数量组成 ^(g)={ vec(X_(i)^(g))=(x_(i,1)^(g),x_(i,2)^(g),cdots,x_(i,D)^(g))∣i=1,2,cdots,NP}^{g}=\left\{\overrightarrow{X_{i}^{g}}=\left(x_{i, 1}^{g}, x_{i, 2}^{g}, \cdots, x_{i, D}^{g}\right) \mid i=1,2, \cdots, N P\right\} ,其中 NPN P 表示总体大小, DD 表示目标问题的维度,并 vec(X_(i)^(g))\overrightarrow{X_{i}^{g}} 表示一代 iigg 中总体的第 -个个体。DE 生成的个体可以表示如下。
vec(X_(i)^(g))=lb+(ub-lb)**rand(0,1)\overrightarrow{X_{i}^{g}}=l b+(u b-l b) * \operatorname{rand}(0,1)
The initial population is obtained by uniformly sampling the feasible solution space, where ubu b and lbl b denote the upper and lower bounds of the search space, respectively. 初始总体是通过对可行解空间进行均匀采样获得的,其中 ubu b 和 lbl b 分别表示搜索空间的上限和下限。
2.1.2. Mutation 2.1.2. 突变
After initialization, the new mutation vector vec(V_(i)^(g))={v_(i,1)^(g),v_(i,2)^(g),cdots,v_(i,D)^(g)}\overrightarrow{V_{i}^{g}}=\left\{v_{i, 1}^{g}, v_{i, 2}^{g}, \cdots, v_{i, D}^{g}\right\} 初始化后,新的突变向量 vec(V_(i)^(g))={v_(i,1)^(g),v_(i,2)^(g),cdots,v_(i,D)^(g)}\overrightarrow{V_{i}^{g}}=\left\{v_{i, 1}^{g}, v_{i, 2}^{g}, \cdots, v_{i, D}^{g}\right\}
Fig. 1. Markov decision process. 图 1.马尔可夫决策过程。
Fig. 2. Updating method of Q-learning. 图 2.Q-learning 的更新方法。
is generated by the population using mutation operator. The mutation strategy of DE can be described as "DE/ x//y//zx / y / z ", where xx represents the vector information used to guide the population mutation. For example, “rand” represents the random vector, “best” represents the optimal vector in the population, “current” represents the current vector itself. yy represents the number of vectors involved in variation, and z represents the crossover method adopted by DE, where “bin” represents binomial crossover and “exp” represents exponential crossover. Basic DE and other popular mutation strategies are listed as follows. 由群体使用 mutation 运算符生成。DE 的突变策略可以描述为 “DE/ x//y//zx / y / z ”,其中 xx 表示用于指导群体突变的向量信息。例如,“rand” 表示随机向量,“best” 表示总体中最优向量,“current” 表示当前向量本身。 yy 表示变化中涉及的向量数,z 表示 DE 采用的交叉方法,其中 “bin” 表示二项式交叉,“exp” 表示指数交叉。基本 DE 和其他流行的突变策略如下。
“DE/rand/1” ^([7]){ }^{[7]} : “DE/rand/1” ^([7]){ }^{[7]} : vec(V_(i)^(g))= vec(X_(r1)^(g))+F∙( vec(X_(r2)^(g))- vec(X_(r3)^(g)))\overrightarrow{V_{i}^{g}}=\overrightarrow{X_{r 1}^{g}}+F \bullet\left(\overrightarrow{X_{r 2}^{g}}-\overrightarrow{X_{r 3}^{g}}\right)
“DE/rand/2”: “DE/rand/2”: vec(V_(i)^(g))= vec(X_(r1)^(g))+F∙( vec(X_(r2)^(g))- vec(X_(r3)^(g)))+F∙( vec(X_(r4)^(g))- vec(X_(r5)^(g)))\overrightarrow{V_{i}^{g}}=\overrightarrow{X_{r 1}^{g}}+F \bullet\left(\overrightarrow{X_{r 2}^{g}}-\overrightarrow{X_{r 3}^{g}}\right)+F \bullet\left(\overrightarrow{X_{r 4}^{g}}-\overrightarrow{X_{r 5}^{g}}\right)
“DE/best/1”: “DE/best/1”: vec(V_(i)^(g))= vec(X_("best ")^(g))+F∙( vec(X_(r1)^(g))- vec(X_(r2)^(g)))\overrightarrow{V_{i}^{g}}=\overrightarrow{X_{\text {best }}^{g}}+F \bullet\left(\overrightarrow{X_{r 1}^{g}}-\overrightarrow{X_{r 2}^{g}}\right)
“DE/best/2”: “DE/best/2”: vec(V_(i)^(g))= vec(X_("best ")^(g))+F∙( vec(X_(r1)^(g))- vec(X_(r2)^(g)))+F∙( vec(X_(r3)^(g))- vec(X_(r4)^(g)))\overrightarrow{V_{i}^{g}}=\overrightarrow{X_{\text {best }}^{g}}+F \bullet\left(\overrightarrow{X_{r 1}^{g}}-\overrightarrow{X_{r 2}^{g}}\right)+F \bullet\left(\overrightarrow{X_{r 3}^{g}}-\overrightarrow{X_{r 4}^{g}}\right)
“DE/current-to-rand/1”: “DE/current-to-rand/1”: vec(V_(i)^(g))= vec(X_(i)^(g))+K∙( vec(X_(r1)^(g))- vec(X_(i)^(g)))+F∙( vec(X_(r2)^(g))- vec(X_(r3)^(g)))\overrightarrow{V_{i}^{g}}=\overrightarrow{X_{i}^{g}}+K \bullet\left(\overrightarrow{X_{r 1}^{g}}-\overrightarrow{X_{i}^{g}}\right)+F \bullet\left(\overrightarrow{X_{r 2}^{g}}-\overrightarrow{X_{r 3}^{g}}\right)
“DE/current-to-best/1”: “DE/current-to-best/1”: vec(V_(i)^(g))= vec(X_(i)^(g))+F∙( vec(X_("best ")^(g))- vec(X_(i)^(g)))+F∙( vec(X_(r1)^(g))- vec(X_(r2)^(g)))\overrightarrow{V_{i}^{g}}=\overrightarrow{X_{i}^{g}}+F \bullet\left(\overrightarrow{X_{\text {best }}^{g}}-\overrightarrow{X_{i}^{g}}\right)+F \bullet\left(\overrightarrow{X_{r 1}^{g}}-\overrightarrow{X_{r 2}^{g}}\right)
“DE/current-to-pbest/1”: “DE/current-to-pbest/1”: vec(V_(i)^(g))= vec(X_(i)^(g))+F∙( vec(X_(pbest)^(g))- vec(X_(i)^(g)))+F∙( vec(X_(r1)^(g))- vec(X_(r^('))^(g)))\overrightarrow{V_{i}^{g}}=\overrightarrow{X_{i}^{g}}+F \bullet\left(\overrightarrow{X_{p b e s t}^{g}}-\overrightarrow{X_{i}^{g}}\right)+F \bullet\left(\overrightarrow{X_{r 1}^{g}}-\overrightarrow{X_{r^{\prime}}^{g}}\right)
In the mutation operators above, r1,r2,r3,r4r 1, r 2, r 3, r 4 and r5r 5 are the index values randomly sampled from NPN P individuals, and the index values should be mutually exclusive ( r1!=r2!=r3!=r4!=r5r 1 \neq r 2 \neq r 3 \neq r 4 \neq r 5 ). vec(X_(i)^(g))\overrightarrow{X_{i}^{g}} represents the current individual. vec(X_(best)^(g))\overrightarrow{X_{b e s t}^{g}} denotes the best individual of the population in generation gg, which is randomly chosen as one of top 100 p%100 p \% individuals in the current population, p in(0,1].Fp \in(0,1] . F is the scaling factor, usually KK and FF are randomly selected within the range of [0,1][0,1]. 在上面的 mutation 运算符中, r1,r2,r3,r4r 1, r 2, r 3, r 4 和 r5r 5 是从 NPN P 个体中随机采样的索引值,并且索引值应该是互斥的 ( r1!=r2!=r3!=r4!=r5r 1 \neq r 2 \neq r 3 \neq r 4 \neq r 5 )。 vec(X_(i)^(g))\overrightarrow{X_{i}^{g}} 代表当前个人。 vec(X_(best)^(g))\overrightarrow{X_{b e s t}^{g}} 表示 世代 gg 中种群中的最佳个体,即随机选择为当前种群中排名靠前 100 p%100 p \% 的个体之一, p in(0,1].Fp \in(0,1] . F 是比例因子,通常 KK 和 FF 在 范围内随机选择 [0,1][0,1] 。
2.1.3. Crossover 2.1.3. 交叉
After the mutation, the mutation vector vec(V_(i)^(g))\overrightarrow{V_{i}^{g}} crosses with the parent to generate a new trial vector vec(U_(i)^(g))={u_(i,1)^(g),u_(i,2)^(g),cdots,u_(i,D)^(g)}\overrightarrow{U_{i}^{g}}=\left\{u_{i, 1}^{g}, u_{i, 2}^{g}, \cdots, u_{i, D}^{g}\right\}. Binomial crossover is the most used crossover operation in DE, which can be expressed as follows. 突变后,突变向量 vec(V_(i)^(g))\overrightarrow{V_{i}^{g}} 与父向量交叉以生成新的试验向量 vec(U_(i)^(g))={u_(i,1)^(g),u_(i,2)^(g),cdots,u_(i,D)^(g)}\overrightarrow{U_{i}^{g}}=\left\{u_{i, 1}^{g}, u_{i, 2}^{g}, \cdots, u_{i, D}^{g}\right\} 。二项式交叉是 DE 中最常用的交叉运算,可以表示如下。 u_(i,j)^(g)={[x_(i,j)^(g)","(" rand "_(j) <= CR)" or "(j=" jrand ")],[v_(i,j)^(g)","" otherwise "]:}u_{i, j}^{g}=\left\{\begin{array}{c}x_{i, j}^{g},\left(\text { rand }_{j} \leq C R\right) \text { or }(j=\text { jrand }) \\ v_{i, j}^{g}, \text { otherwise }\end{array}\right.
where CRC R is the crossover rate (CR in[0,1])(C R \in[0,1]), and jrand is an integer between [0,D][0, D]. 其中 CRC R 是交叉率 (CR in[0,1])(C R \in[0,1]) ,jrand 是 之间的 [0,D][0, D] 整数。
2.1.4. Selection 2.1.4. 选择
The fitness values of the vector u_(i,j)^(g)u_{i, j}^{g} obtained after crossover are calculated and compared with the fitness values of the original vector x_(i,j)^(g)x_{i, j}^{g}. The excellent individuals are retained into the next generation as the initial population of evolution. Minimization problems are considered in this paper, and the selection operator can be expressed as follows. 计算交叉后获得的向量 u_(i,j)^(g)u_{i, j}^{g} 的适应度值,并与原始向量 x_(i,j)^(g)x_{i, j}^{g} 的适应度值进行比较。优秀的个体被保留到下一代中,作为进化的初始种群。本文考虑了最小化问题,选择运算符可以表示如下。 vec(X_(i)^(g+1))={[ vec(X_(i)^(g))","" iff "(u_(i)^(g)) > f( vec(X_(i)^(g)))],[u_(i)^(g)","" otherwise "]:}\overrightarrow{X_{i}^{g+1}}=\left\{\begin{array}{c}\overrightarrow{X_{i}^{g}}, \text { iff }\left(u_{i}^{g}\right)>f\left(\overrightarrow{X_{i}^{g}}\right) \\ u_{i}^{g}, \text { otherwise }\end{array}\right.
where the individual fitness value is represented by f(∙)f(\bullet). In the process of evolution, the three stages of mutation, crossover and selection are repeated until the convergence conditions are reached. 其中,单个适应度值用 f(∙)f(\bullet) 表示。在进化过程中,重复突变、交叉和选择三个阶段,直到达到收敛条件。
2.2. Reinforcement learning 2.2. 强化学习
2.2.1. Markov decision process 2.2.1. 马尔可夫决策过程
A basic Markov decision process (MDP) model (Watkins & Dayan, 1992) is composed of a five tuple (:S,A,P,R,gamma:)\langle S, A, P, R, \gamma\rangle. The state space SS of an event consists of a number of finite states. The action space AA of an event consists of a number of limited actions. PP is the transition probability between states, which is composed of a state action transition probability matrix. RR is the expected reward generated by the agent’s state transition. gamma\gamma is the discount factor, which is mainly used to describe the influence of the past state on the current state in the process of the agent’s state transfer. As shown in Fig. 1 the state transition of things in MDP is only related to the current state and the actions taken in the current state, without considering the previous state transition. 一个基本的马尔可夫决策过程(MDP)模型(Watkins & Dayan,1992)由一个五元组组成 (:S,A,P,R,gamma:)\langle S, A, P, R, \gamma\rangle 。事件的状态空间 SS 由许多有限状态组成。事件的操作空间 AA 由许多有限的操作组成。 PP 是状态之间的转移概率,它由一个状态动作转移概率矩阵组成。 RR 是代理的状态转换生成的预期奖励。 gamma\gamma 是贴现因子,主要用于描述在代理体进行状态转移的过程中,过去状态对当前状态的影响。如图 1 所示,MDP 中事物的状态转换只与当前状态和在当前状态下采取的行动有关,而不考虑之前的状态转换。
2.2.2. Q-learning 2.2.2. Q-学习
As the most classic algorithm in RL, Q-learning is a model-free method based on temporal difference (TD) (Das, Mullick & Suganthan, 2016). Q-learning can effectively make autonomous decision-making based on cumulative reward and real-time state feedback without a priori information. Q-learning makes decisions by calculating the cumulative reward obtained by the agent, and it can make better use of the historical information of the agents. The updating method of Q-learning is depicted in Fig. 2. The state-action value function update mode of Qlearning can be expressed as follows. 作为RL中最经典的算法,Q-learning是一种基于时间差异(TD)的无模型方法(Das,Mullick & Suganthan),2016)。Q-learning 可以有效地根据累积奖励和实时状态反馈做出自主决策,而无需先验信息。Q-learning 通过计算智能体获得的累积奖励来做出决策,可以更好地利用智能体的历史信息。Q-learning 的更新方法如图 2 所示。Qlearning 的状态-动作值函数更新模式可以表示如下: Q(s,a)larr Q(s,a)+alpha[r+gammamax_(a)Q(s^('),a^('))-Q(s,a)]Q(s, a) \leftarrow Q(s, a)+\alpha\left[r+\gamma \max _{a} Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right]
where ss is the current state, aa is the action taken by s,alphas, \alpha is the learning rate, and rr is the immediate reward obtained by executing action aa under s.gammas . \gamma is a discount factor, which is used to regulate the influence of past cumulative rewards on the current state as the step length increases. s^(')s^{\prime} represents the subsequent state that the ss takes action aa to reach, and a^(')a^{\prime} is the action selected in the s^(')s^{\prime}. 其中 ss 是当前状态, aa 是 是学习率所采取 s,alphas, \alpha 的行动, rr 是 是执行动作 aa 获得的即时奖励 s.gammas . \gamma 是折扣因子,用于调节随着步长的增加,过去的累积奖励对当前状态的影响。 s^(')s^{\prime} 表示 ss 执行操作 aa 以达到的后续状态, a^(')a^{\prime} 并且是在 中选择的操作 s^(')s^{\prime} 。
3. DE with autonomous strategy selection 3. 具有自主策略选择的 DE
It is well known that the performance of DE is affected by mutation strategies (Hassan et al., 2020; Suganthan et al., 2005). Over the past 众所周知,DE 的性能受突变策略的影响(Hassan et al., 2020;Suganthan等人,2005 年)。过去