Introduction 引言
Multiple unmanned aerial vehicles (multi-UAV) have increasingly been applied in long-term urban services, such as geohazards monitoring [1], data collection [2], and medical supply delivery [3]. To ensure sustained operations in geo-distributed environments, mission planning must prioritize both UAV deployment and task assignment, enabling effective system collaboration and mission continuity.
多个无人机系统(multi-UAV)越来越多地应用于长期城市服务,如地质灾害监测 [1] 、数据收集 [2] 和医疗物资配送 [3] 。为确保在地理分布式环境中持续运行,任务规划必须优先考虑无人机部署和任务分配,以实现有效的系统协作和任务连续性。
While integrated schemes for multi-UAV deployment and task assignment manage ground users across diverse locations sustainably [4], unforeseen circumstances often exceed predefined plans [5]. This necessitates robust replanning strategies for on-demand dispatching and real-time decision-making to respond to evolving task requests and emergencies. Addressing these challenges requires tailored frameworks and advanced methodologies to coordinate interdependent subproblems.
虽然针对多无人机部署和任务分配的集成方案可持续地管理不同地点的地面用户 [4] ,但意外情况往往超出预定义计划 [5] 。这需要强大的重规划策略,以应对实时调度和动态任务请求及紧急情况。应对这些挑战需要定制框架和先进方法来协调相互依赖的子问题。
In this article, we propose a replanning-oriented framework for efficient real-time decision-making in dynamic multi-UAV environments. Our approach supports long-term mission execution in geo-distributed settings with dynamically emerging ground users, aligning with real-world applications, such as emergency medical services and disaster relief.
在本文中,我们提出了一种面向重规划的框架,用于在动态多无人机环境中进行高效的实时决策。我们的方法支持在地理分布式设置中执行长期任务,具有动态涌现的地面用户,并与现实世界的应用相一致,例如紧急医疗服务和灾害救援。
A. Related Works A. 相关工作
1) Joint Decision-Making Framework
1) 联合决策框架
Recent research has focused on addressing the interdependencies between planning subproblems by employing a joint decision-making framework. By developing dual-layer architectures, the joint optimization of multi-UAV deployment and task assignment has been effectively applied in various practical scenarios, including UAV-enabled mobile edge computing [4], [6], helicopter-UAVs search and rescue operations [7], and aerial base stations for emergency communication networks [8], [9]. The aforementioned work highlights the effectiveness of hierarchical planning schemes in managing scattered ground sites and enhancing long-term performances. However, they overlook the dynamic demands that may arise in real-world applications, which have been stated in [10] as one of their future directions. While dynamic task demands have been addressed through a reassignment strategy [11], the real-time responsiveness may be constrained by the algorithm under investigation. Despite the improved decision-making efficiency in handling unexpected events brought from reinforcement learning [12], there remains a lack of integration of key subproblems to effectively tackle replanning-oriented challenges.
近期研究主要集中于通过采用联合决策框架解决规划子问题之间的相互依赖关系。通过开发双层架构,多 UAV 部署和任务分配的联合优化已有效地应用于各种实际场景,包括 UAV 支持的移动边缘计算 [4] 、 [6] 、直升机-UAV 搜救行动 [7] 以及应急通信网络的空中基站 [8] 、 [9] 。上述工作突出了分层规划方案在管理分散地面站点和提升长期性能方面的有效性。然而,它们忽略了实际应用中可能出现的动态需求,这已在 [10] 中被提出作为其未来方向。虽然通过重新分配策略 [11] 已解决了动态任务需求,但实时响应能力可能受所研究算法的限制。尽管在处理强化学习 [12] 带来的意外事件方面提高了决策效率,但关键子问题的整合仍然不足,无法有效应对面向重规划的挑战。
2) Clustering-Based Multi-UAV Deployment
2) 基于聚类的多无人机部署
The primary concern for deployment decisions is having a sufficient number of UAVs to ensure comprehensive on-demand coverage for ground users [13]. This emphasis on mission completion aligns with the simplicity of clustering algorithms, which have demonstrated effectiveness in generating feasible solutions in mutli-UAV systems [8], [14], [15]. To achieve optimal deployment outcomes, various modifications have been implemented to the K-means clustering algorithm. These include but not limited to a genetic algorithm-enhanced method with a refined Q-learning mechanism [16], the integration of iterative selection and association policies [17], and the hybrid approach based on the expectation-maximization algorithm [18]. Despite ongoing efforts to refine UAV deployment strategies, the susceptibility of K-means algorithm centroids to noise remains a challenge. Research indicates that the utilization of medoids as centroids can significantly enhance robustness against outliers, providing a superior alternative to traditional K-means in UAV deployment scenarios [19], [20]. However, the predetermined number of clusters in K-medoids algorithm restricts its adaptability to fluctuating user demands. In addition, achieving convergence to a global optimum is challenging without the integration of external assistance strategies.
部署决策的主要关注点是有足够数量的无人机以确保为地面用户提供全面的按需覆盖 [13] 。这种对任务完成的强调与聚类算法的简单性相一致,聚类算法已在多无人机系统中证明了生成可行解决方案的有效性 [8] , [14] , [15] 。为了实现最佳的部署结果,对 K-means 聚类算法进行了各种修改。这些修改包括但不限于遗传算法增强的方法,具有精细 Q 学习机制 [16] ,迭代选择和关联策略的集成 [17] ,以及基于期望最大化算法的混合方法 [18] 。尽管在改进无人机部署策略方面持续努力,但 K-means 算法质心对噪声的敏感性仍然是一个挑战。研究表明,使用中位数作为质心可以显著增强对异常值的鲁棒性,为无人机部署场景提供了优于传统 K-means 的替代方案 [19] , [20] 。 然而,K-medoids 算法中预定的聚类数量限制了其对波动用户需求的适应性。此外,如果没有集成外部辅助策略,要实现收敛到全局最优是很具挑战性的。
3) Learning-Based Task Assignment
3) 基于学习的任务分配
Deep Q-learning (DQL) has proven effective in endowing UAV agents with real-time decision-making capabilities, especially for task assignment problems [12], [21]. The partial observability and local information in real-world multi-UAV operations necessitate a fully decentralized training paradigm, prompting the rise of multiagent reinforcement learning (MARL) frameworks [10], [22], [23]. In response, Dai et al. [24] proposed a collaborative MARL algorithm for UAV networks, leveraging deep Q-networks (DQN) to train local models. This approach significantly improves dynamic resource allocation, enhancing both adaptability, and operational efficiency of the system. To deal with the large state and action spaces, Li et al. [25] incorporated an
深度 Q 学习(DQL)已被证明在赋予无人机代理实时决策能力方面非常有效,特别是在任务分配问题 [12] 、 [21] 上。现实世界多无人机操作中的部分可观察性和局部信息需要一种完全去中心化的训练范式,从而推动了多智能体强化学习(MARL)框架 [10] 、 [22] 、 [23] 的出现。为此,Dai 等人 [24] 提出了一种用于无人机网络的协作 MARL 算法,利用深度 Q 网络(DQN)来训练本地模型。这种方法显著提高了动态资源分配,增强了系统的适应性和运营效率。为了处理大的状态和动作空间,Li 等人 [25] 将
B. Motivations B. 动机
Following a comprehensive literature review, we identified a research gap in replanning-oriented real-time decision-making for multi-UAV deployment and task assignment. The specific motivations for this article are outlined as follows.
在全面回顾文献后,我们确定了在多无人机部署和任务分配方面面向重规划的实时决策研究空白。本文的具体动机如下。
Existing research underscores the importance of hierarchical integration of deployment and assignment in addressing long-term mission execution in geo-distributed environments. However, a replanning-oriented framework is required—one that coordinates these coupled subproblems to enable not only optimal solutions but also efficient real-time decision-making in response to dynamic demands.
现有研究表明,在地理分布式环境中执行长期任务时,部署和分配的分层集成非常重要。然而,需要一种面向重规划的框架——该框架协调这些耦合的子问题,以实现不仅是最优解,而且在应对动态需求时能够进行高效的实时决策。While modified K-means clustering algorithms effectively generate feasible deployment solutions for multi-UAV systems, unresolved challenges in existing methods impede the achievement of robust and convergent globally optimal results for long-term missions with potential changes.
虽然改进的 K-means 聚类算法能够有效地为多无人机系统生成可行的部署方案,但现有方法中未解决的问题阻碍了在可能发生变化的长期任务中实现鲁棒和收敛的全局最优结果。Although MARL-based approaches enhance adaptability and operational efficiency in dynamic task assignment, sparse rewards in large state spaces and increased mission complexity hinder effective exploration and reduce learning efficiency in achieving a convergent optimal policy.
尽管基于 MARL 的方法在动态任务分配中增强了适应性和运行效率,但大型状态空间中的稀疏奖励和任务复杂性的增加阻碍了有效探索,并降低了学习效率以实现收敛最优策略。
C. Main Contributions C. 主要贡献
We propose a dual-layer framework for real-time decision-making in multi-UAV mission replanning. Unlike existing hierarchical designs for multi-UAV deployment and task assignment, our approach simultaneously addresses predefined task coverage and responsiveness to dynamic demands. The main contributions are summarized as follows.
我们提出了一种用于多无人机任务重规划的实时决策双层框架。与现有的多无人机部署和任务分配的分层设计不同,我们的方法同时解决了预定义的任务覆盖范围和对动态需求响应的问题。主要贡献总结如下。
We propose a density-constrained K-medoids clustering and simulated annealing (SADCK-Medoid) algorithm for upper layer deployment. This novel approach integrates dynamic partitioning with iterative balanced remaining capability (BRC) minimization, ensuring near-optimal convergence, balanced redundancy for long-term adaptability, and superior clustering quality with reduced efficiency tradeoffs.
我们提出了一种用于上层部署的密度约束 K-medoids 聚类和模拟退火(SADCK-Medoid)算法。这种新颖的方法将动态分区与迭代平衡剩余能力(BRC)最小化相结合,确保近似最优收敛、长期适应性的平衡冗余以及减少效率权衡的优越聚类质量。We propose a goal-oriented belief space MARL (GOBS-MARL) algorithm for real-time task assignment. The method incorporates a Bayesian updating mechanism and goal space mapping into DQL, deriving optimal policies from dynamically updating belief distributions instead of fixed intermediate goals, thereby addressing sparse reward challenges and improving training efficiency.
我们提出了一种面向目标的信念空间多智能体强化学习(GOBS-MARL)算法用于实时任务分配。该方法将贝叶斯更新机制和目标空间映射集成到深度 Q 学习(DQL)中,从动态更新的信念分布中推导出最优策略,而不是固定的中间目标,从而解决稀疏奖励挑战并提高训练效率。Extensive simulations confirm the effectiveness of the proposed framework in facilitating efficient real-time decision-making and generating robust replanning solutions. Specifically, the SADCK-Medoid algorithm outperformed its individual components (K-medoids [27] and simulated annealing (SA) algorithm [28]) and the well-known K-means [29] in global convergence, while achieving competitive clustering quality compared to state-of-the-art (SOTA) K-means variants. The GOBS-MARL algorithm exhibited superior performance over MARL baselines, achieving the highest average reward and the lowest running time, highlighting its adaptability to dynamic environments.
广泛的仿真验证了所提出的框架在促进高效实时决策和生成鲁棒重规划解决方案方面的有效性。具体而言,SADCK-Medoid 算法在其各个组成部分(K-medoids [27] 和模拟退火(SA)算法 [28] )以及众所周知的 K-means [29] 中表现优于全局收敛,同时在与最先进(SOTA)K-means 变体的聚类质量方面具有竞争力。GOBS-MARL 算法在 MARL 基线中表现出优越的性能,实现了最高的平均奖励和最短的运行时间,突显了其在动态环境中的适应性。
D. Organization D. 组织结构
The rest of this article is organized as follows. Section II presents preliminaries of the work. Section III describes the proposed framework and algorithms. In Section IV, we analyze the simulation results and conduct a discussion. Finally, Section V concludes this article.
本文其余部分组织如下。第 II 节介绍工作基础。第 III 节描述所提出的框架和算法。在 第 IV 节中,我们分析仿真结果并进行讨论。最后,第 V 节总结本文。
Preliminaries 预备知识
We consider a typical scenario, as illustrated in Fig. 1(a), characterized by distinct geographical task regions scattered across an urban area. In this scenario,
我们考虑一个典型场景,如图 1(a) 所示,该场景的特点是在城市区域内分布着不同的地理任务区域。在这个场景中,
Diagram of the proposed replanning-oriented decision-making framework.
提出的面向重规划的决策框架图。
In terms of UAV, we assume that each
就无人机而言,我们假设
Under centralized deployment, UAVs communicate via the central server, whereas distributed task assignment relies on local information exchanges within each subgroup, requiring UAVs to act as autonomous decision-makers with independent computing capabilities. The intermediary process between these two decisions is not taken into account.
在集中式部署下,无人机通过中央服务器进行通信,而分布式任务分配则依赖于每个子组内的本地信息交换,需要无人机作为具有独立计算能力的自主决策者。这两种决策之间的中介过程未被考虑。
A. BRC-Oriented Multi-UAV Deployment
A. 基于 BRC 的多无人机部署
The deployment decision divides a swarm of


部署决策将一群无人机


By applying (1), differences among subgroups across distinct task regions can be quantified. A lower
通过应用 (1) ,可以量化不同任务区域中子组之间的差异。
B. Decentralized-Partially Observable Markov Decision Process (Dec-POMDP)-Based Task Assignment
B. 基于去中心化部分可观察马尔可夫决策过程(Dec-POMDP)的任务分配
The task assignment is conducted within each task region. Each
任务分配在每个任务区域内进行。
State: We define a state in time slot
状态:我们在时间槽
Action: We define the action chosen by
动作:我们在时间槽
Transition probability: The transition probability
转移概率:转移概率
Observation space and function: An agent
观察空间和函数:智能体
Reward: Each

奖励:每个

The belief space is updated using the Bayesian updating theorem, where the posterior belief at time


信念空间使用贝叶斯更新定理进行更新,其中时间


C. Optimization-Based Problem Formulation
C. 基于优化的问题描述
The global objective is achieved in the deployment process, in which each subgroup


全局目标在部署过程中得以实现,其中每个子组


The task assignment subproblem within each

每个

Proposed Framework and Algorithms
提出的框架和算法
A. Overview of the Decision-Making Framework
A. 决策框架概述
We implement the two-phase decision-making process through a hierarchical framework designed to optimize in a coordinated manner. As illustrated in Fig. 1(a), the upper layer deploys
我们通过一个分层框架实现两阶段决策过程,该框架旨在以协调的方式优化。如图 1(a) 所示,上层将
To adapt to unpredictable events, coordination is facilitated through the BRC determination mechanism. As shown in Fig. 1(a), the central server collects current assignment results and updated requests, calculates the BRC value, and determines the need for replanning. This process is guided by the BRC threshold,
为了适应不可预测的事件,通过 BRC 确定机制促进协调。如图 1(a) 所示,中央服务器收集当前的分配结果和更新请求,计算 BRC 值,并确定是否需要重新规划。该过程由 BRC 阈值
B. SADCK-Medoid Algorithm
B. SADCK-中心点算法
As shown in Fig. 1(b), the K-medoids algorithm [27] classifies numerous ground users, while SA [28] allocates UAVs into task-specific subgroups. By executing these two processes in parallel, adaptability to dynamic changes is greatly enhanced. Given the set



如图 1(b) 所示,K-medoids 算法 [27] 对众多地面用户进行分类,而 SA [28] 将无人机分配到特定任务的子组中。通过并行执行这两个过程,极大地增强了适应动态变化的能力。给定集合



Targets with higher
数值越高的目标越有可能被选为中心点。SADCK-Medoid 算法的实现步骤详见算法 1 。
Algorithm 1: SADCK-Medoid.
算法 1:SADCK-Medoid。
for each iteration
对每次迭代
Randomly initialize the first centroid
随机初始化第一个中心点
Initialize all clusters as empty:
将所有聚类初始化为空:
for
Assign location
将位置
if
Reject and reassign to the next nearest cluster;
拒绝并重新分配到最近的下一个簇;
else
end if if end
end for for end
Partition UAVs into
将无人机任意划分为
for
for
Move UAV
将无人机
if
如果
Accept the movement; 接受移动;
else
Reject the movement. 拒绝移动。
end if if end
end for for end
end for for end
end for for end
C. Goal-Oriented Belief Space MARL
C. 以目标为导向的信念空间多智能体强化学习
The proposed GOBS-MARL algorithm tackles real-time task assignment by integrating belief-oriented hindsight experience replay (HER) [32], MADQL, and a value decomposition network (VDN). The process encompasses initialization, action execution, belief updating, experience replay and training, and optimization, as detailed in Algorithm 2.
所提出的 GOBS-MARL 算法通过整合以信念为导向的后见之明经验重放(HER)、MADQL 和一个值分解网络(VDN)来解决实时任务分配问题。该过程包括初始化、动作执行、信念更新、经验重放和训练以及优化,如算法 2 所示。
Belief-oriented HER: Actions aligned with the belief distribution and moving in the correct direction are assigned positive rewards. At each time step, agent experiences, denoted as
基于信念的 HER:与信念分布一致且朝正确方向采取的行动会被赋予正奖励。在每个时间步,智能体经历(记为
MADQL: The action-value function based on a policy


MADQL:基于策略


The optimal value function


最优值函数


VDN: Assuming full cooperation among agents, all agents share the same reward function. However, joint actions


VDN:假设代理之间完全合作,所有代理共享相同的奖励函数。然而,联合行动


Algorithm 2: GOBS-MARL. 算法 2:GOBS-MARL
Initialize the environment, Q-functions, target functions, replay buffer
初始化环境、Q 函数、目标函数、重放缓冲区
for each episode
对每个回合
for each timestep
对每个时间步
Select individual actions for each agent using the current Q-function and a greedy policy as in (13);
使用当前的 Q 函数和贪婪策略(如 (13) )为每个智能体选择个体动作;
Execute joint actions
执行联合动作
Store transition
将状态转移
if Replay buffer
如果重放缓冲区
Sample minibatch
基于
for each sample
对每个样本
end for for end
Compute the total TD loss across the minibatch;
计算整个小批量中的总 TD 损失;
Update network parameters
更新网络参数
end if if end
end for for end
end for 结束 for 循环
D. Convergence Analysis D. 收敛性分析
Theorem 1:
The action-value function
定理 1:动作值函数
Proof:
The action-value function is updated using the learning rate




证明:动作值函数使用学习率




The state space


状态空间


Since all assumptions hold,
由于所有假设均成立,
Simulation Results and Discussion
仿真结果与讨论
A. Dataset and Simulation Setup
A. 数据集和仿真设置
We evaluated the proposed methods through simulations using real-world datasets. The distribution of ground users was generated based on the Emergency 911 Calls dataset [34], which contains 664 000 records of emergency activities in the USA over five years. Each record includes eight key attributes detailing locations, times, types, and requirements of specific events. We extracted a subset of 152 points within a 24-h period, representing emergency events in a 50× 50 km
我们使用真实世界数据集通过仿真评估了所提出的方法。地面用户的分布基于美国五年内的紧急活动记录的紧急 911 呼叫数据集 [34] 生成,该数据集包含 664000 条记录。每条记录包括八个关键属性,详细说明了特定事件的位置、时间、类型和要求。我们从 24 小时内提取了 152 个点,代表美国宾夕法尼亚州 50×50 km 的
We adopted a centralized training and decentralized execution scheme. During training, agent experiences were collected in a shared replay buffer, sampled randomly, and updated uniformly using the Bayesian theorem. In the execution phase, each agent independently made decisions using its own neural network, comprising four hidden layers with 128, 128, 64, and 64 neurons, respectively. The model was trained over 500 episodes, with each episode consisting of 100 execution steps.
我们采用了一种集中式训练和分布式执行方案。在训练过程中,智能体经验被收集到一个共享的回放缓冲区中,随机采样,并使用贝叶斯定理进行均匀更新。在执行阶段,每个智能体独立地使用自己的神经网络进行决策,该神经网络包含四个隐藏层,分别具有 128、128、64 和 64 个神经元。该模型在 500 个回合上进行了训练,每个回合由 100 个执行步骤组成。
For upper layer optimization, we evaluated convergence using the Silhouette score, Calinski–Harabasz score, and Davies–Bouldin index (Dbi). The global objective was to minimize the BRC, with an ideal threshold of
对于上层优化,我们使用轮廓系数、Calinski–Harabasz 系数和 Davies–Bouldin 指数(Dbi)评估了收敛性。全局目标是使 BRC 最小化,理想阈值为
The proposed approach was implemented using PyTorch in Python on a system with an AMD Ryzen 7 5800H CPU, 3060 Laptop GPU, and Windows 10. Table I provides the empirical parameter settings.
所提出的方法使用 Python 中的 PyTorch 在一台配备 AMD Ryzen 7 5800H CPU、3060 笔记本 GPU 和 Windows 10 的系统上实现。表 I 提供了经验参数设置。
B. Performance Analysis of Upper Layer Deployment
B. 上层部署的性能分析
To evaluate the effectiveness of the proposed algorithm, we first analyzed its convergence properties using indices with 95% confidence intervals. Fig. 2(a) shows the Silhouette score, which measures cluster dissimilarity, converging at 0.5. Fig. 2(b) illustrates the Calinski–Harabasz score, representing the ratio of between-cluster to within-cluster scatter, peaking at approximately 250. Fig. 2(c) presents the Dbi curve, a measures of cluster compactness, stabilizing at 0.65 after around 120 iterations. These results demonstrate the algorithm's robust convergence and effectiveness in achieving well-balanced and compact clusters for multi-UAV deployment.
为了评估所提出算法的有效性,我们首先使用具有 95%置信区间的指标分析了其收敛特性。图 2(a) 显示了轮廓分数,该分数衡量簇间差异度,收敛于 0.5。图 2(b) 说明了 Calinski–Harabasz 分数,表示簇间散度与簇内散度的比率,在约 250 处达到峰值。图 2(c) 展示了 Dbi 曲线,这是一种衡量簇紧密度指标,在约 120 次迭代后稳定在 0.65。这些结果表明该算法具有稳健的收敛性和有效性,能够在多 UAV 部署中实现平衡且紧凑的簇。
Convergence properties of SADCK-Medoid algorithm. (a) Silhouette score. (b) Calinski–Harabasz score. (c) Davies–Bouldin score.
SADCK-Medoid 算法的收敛特性。(a)轮廓系数。(b)Calinski–Harabasz 指数。(c)Davies–Bouldin 指数。
Second, we analyzed deployment results in a dynamic environment with 152 ground users (circles). The 24-h period was divided into three equal stages, with the central server updating replanning decisions at the end of each stage based on new information. Fig. 3 illustrates clustering outcomes for ground users and emergency events after a two-stage updating process (inverted triangles and regular triangles, respectively). Ground users were divided into five task regions, each represented by a distinct color. Initial deployment results, shown in Fig. 4(a), allocated 4, 6, 3, 7, and 6 drones to regions with 9, 17, 9, 17, and 12 users, respectively, leaving fourteen drones at the base. After the first emergency, user numbers increased in all regions except TR1, but deployment remained unchanged. Following the second emergency, significant user increases prompted redeployment, with 6, 9, 5, 10, and 7 drones allocated per region, and reinforcement drones dispatched from the base. Fig. 4(b) shows the evolution of the BRC across three stages. In the first stage, the BRC converged to 19.11 using SADCK-Medoid. In the second stage, the BRC rose to 27.27 but remained below the threshold, avoiding redeployment. In the third stage, the BRC exceeded the threshold but reconverged to an acceptable level with algorithmic assistance. These results demonstrate the algorithm's ability to adaptively manage multi-UAV deployment, ensuring balanced dispatch and readiness for emerging demands.
其次,我们在一个包含 152 名地面用户(圆形)的动态环境中分析了部署结果。24 小时周期被分为三个相等阶段,中央服务器在每个阶段结束时根据新信息更新重规划决策。图 3 展示了经过两阶段更新过程后地面用户和紧急事件的聚类结果(分别为倒三角形和普通三角形)。地面用户被划分为五个任务区域,每个区域用不同的颜色表示。初始部署结果如图 4(a) 所示,分别为 9、17、9、17 和 12 名用户的区域分配了 4、6、3、7 和 6 架无人机,基地留有十四架无人机。第一次紧急事件发生后,除 TR1 区域外所有区域的用户数量增加,但部署保持不变。第二次紧急事件后,由于用户数量显著增加,进行了重新部署,每个区域分别分配了 6、9、5、10 和 7 架无人机,并从基地派出增援无人机。图 4(b) 显示了 BRC 在三个阶段中的演变过程。在第一阶段,BRC 使用 SADCK-Medoid 收敛到 19.11。 在第二阶段,BRC 上升到 27.27 但仍然低于阈值,避免了重新部署。在第三阶段,BRC 超过了阈值,但在算法辅助下重新收敛到可接受的水平。这些结果表明该算法能够自适应地管理多 UAV 部署,确保平衡调度和对新需求的响应能力。
Deployment results and BRC evolution curves in dynamic environments. (a) Number of deployed drones and task changes in three stages. (b) BRC evolution curves in different stages.
动态环境中的部署结果和 BRC 演化曲线。(a)三个阶段中部署的无人机数量和任务变化。(b)不同阶段的 BRC 演化曲线。
To further validate the proposed approach, we conducted an ablation study and compared SADCK-Medoid with constrained K-medoids [27], SA [28], and K-means [29] in minimizing global BRC under varying numbers of ground users. As shown in Fig. 5, SADCK-Medoid consistently outperformed its individual components and K-means, achieving impressive results with 150 ground users by minimizing the BRC value within 250 iterations [see Fig. 5(a)]. Although additional iterations were required for lager numbers of users [see Fig. 5(b) and (c)], it maintained competitive global convergence performance.
为了进一步验证所提出的方法,我们进行了一项消融研究,并将 SADCK-Medoid 与约束 K-medoids [27] 、SA [28] 和 K-means [29] 在不同数量的地面用户下最小化全局 BRC 进行了比较。如图 5 所示,SADCK-Medoid 始终优于其各个组成部分和 K-means,通过在 250 次迭代内最小化 BRC 值,使用 150 个地面用户取得了令人印象深刻的结果 [见图 5(a) ]。尽管对于更多用户需要额外的迭代 [见图 5(b) 和 (c) ],但它保持了具有竞争力的全局收敛性能。
Comparative curves of BRC evolution across varying numbers of ground users. (a) Total number of 150 ground users. (b) Total number of 200 ground users. (c) Total number of 250 ground users.
BRC 演化在不同地面用户数量下的对比曲线。(a) 150 个地面用户的总数。(b) 200 个地面用户的总数。(c) 250 个地面用户的总数。
The comparison, summarized in Table II, highlights the proposed algorithm's balance between quality and efficiency. With a time complexity of
表 II 中的比较突出了所提出算法在质量和效率之间的平衡。具有时间复杂度
C. Performance Analysis of Lower Layer Assignment
C. 下层分配的性能分析
Task assignment was conducted for each subgroup within their respective task regions based on the upper layer deployment outcomes, utilizing the proposed GOBS-MARL algorithm. Fig. 6 presents the task assignment results across five subgraphs, each representing a specific task region. The geographical distributions align with the deployment results, ensuring comprehensive ground user coverage. Initially, drones were strategically positioned near existing targets to maximize the likelihood of detecting unpredictable events. As new ground users emerged, drones executed tasks guided by the global BRC objective and their capacity (indicated by light-colored circular coverage areas). Redeployment was activated when necessary, employing additional drones to ensure updated ground users were adequately covered (dark circular coverage areas). Apparently, the well-trained GOBS-MARL effectively ensures comprehensive task coverage while adapting to dynamic ground user updates.
任务分配基于上层部署结果,在各自的任务区域内对每个子组进行,利用所提出的 GOBS-MARL 算法。图 6 展示了跨越五个子图的分配结果,每个子图代表一个特定的任务区域。地理分布与部署结果一致,确保了全面的地面用户覆盖。最初,无人机被策略性地部署在现有目标附近,以最大化检测不可预测事件的可能性。随着新的地面用户出现,无人机在全局 BRC 目标和其容量(由浅色圆形覆盖区域表示)的指导下执行任务。当需要时,会启动重新部署,使用额外的无人机以确保更新的地面用户得到充分覆盖(深色圆形覆盖区域)。显然,训练良好的 GOBS-MARL 有效地确保了全面的任务覆盖,同时适应动态地面用户更新。
Coverage of ground users in each task region based on GOBS-MARL algorithm. (a) TR1. (b) TR2. (c) TR3. (d) TR4. (e) TR5.
基于 GOBS-MARL 算法的每个任务区域的地面用户覆盖情况。(a) TR1。(b) TR2。(c) TR3。(d) TR4。(e) TR5。
To evaluate the effectiveness of the assignment policies learned by GOBS-MARL, we compared its training rewards with those of the MADQN [38] across each task region. Using the third stage as an an example, Fig. 7 illustrates the learning curve comparisons for each task region. Fig. 7(b) highlights GOBS-MARL's significant early-phase efficiency improvement. In Fig. 7(a), (d), and (e), both algorithms show sharp initial increases in average rewards, followed by steady increments and smooth progress after approximately 100 episodes. Notably, GOBS-MARL consistently achieves the highest average reward in these regions. Although the performance appears similar in Fig. 7(c), the cumulative rewards across all task regions, as shown in Fig. 8, clearly demonstrate the superior efficiency and training performance of the proposed GOBS-MARL algorithm. As shown, it exhibits an earlier upward trend, achieving higher rewards after approximately 20 episodes and maintaining stability thereafter. This result highlights the effectiveness of goal-oriented belief space updating mechanism in improving learning efficiency.
为了评估 GOBS-MARL 所学习的分配策略的有效性,我们将它的训练奖励与 MADQN [38] 在每个任务区域中的奖励进行了比较。以第三阶段为例,图 7 展示了每个任务区域的学习曲线比较。图 7(b) 突出了 GOBS-MARL 在早期阶段的显著效率提升。在图 7(a) 、 (d) 和 (e) 中,两种算法的平均奖励都显示出急剧的初始增长,随后在约 100 个回合后稳步增加并平滑进展。值得注意的是,GOBS-MARL 在这些区域始终能获得最高的平均奖励。尽管图 7(c) 中的性能看起来相似,但如图 8 所示,所有任务区域的累积奖励清楚地展示了所提出的 GOBS-MARL 算法的优越效率和训练性能。如图所示,它表现出更早的上行趋势,在约 20 个回合后获得更高的奖励,并在此后保持稳定。这一结果突出了目标导向信念空间更新机制在提高学习效率方面的有效性。
Comparison of the learning curves of GOBS-MARL and MADQN for each task region. (a) TR1. (b) TR2. (c) TR3. (d) TR4. (e) TR5.
对 GOBS-MARL 和 MADQN 在每个任务区域的学习曲线比较。(a) TR1。(b) TR2。(c) TR3。(d) TR4。(e) TR5。
Table III summarizes the statistical results of the evaluated algorithms. GOBS-MARL achieves the highest average reward of 12 290, surpassing all compared methods. It demonstrates efficiency with a shorter inference time of 0.286 s and fewer FLOPs (28) compared to MAA2C and VDPPO. While MADQN shows lower computational complexity, it suffers from significantly higher training (3823) and testing (186) times. In contrast, GOBS-MARL significantly reduces these times to 3779 and 184, respectively, highlighting its suitability for real-time decision-making. In addition, comparisons among GOBS-MARL, GOBS-A, and GOBS-B reveal that the belief-based trend learning mechanism outperforms fixed intermediate goals and penalty-based reward scheme. In general, GOBS-MARL demonstrates notable advantages in both effectiveness and efficiency, with only a minimal increase in inference time. This performance is largely attributed to the Bayesian updating mechanism and goal space mapping, which dynamically refine beliefs and optimize goals, mitigating sparse rewards and enhancing training efficiency.
表 III 总结了评估算法的统计结果。GOBS-MARL 实现了 12,290 的最高平均奖励,超越了所有比较方法。它展示了效率,具有更短的推理时间(0.286 秒)和更少的 FLOPs(28),与 MAA2C 和 VDPPO 相比。虽然 MADQN 的计算复杂度较低,但它受到训练时间(3823)和测试时间(186)显著增加的影响。相比之下,GOBS-MARL 将这两个时间分别显著减少到 3779 和 184,突显了其在实时决策中的适用性。此外,GOBS-MARL、GOBS-A 和 GOBS-B 之间的比较表明,基于信念的趋势学习机制优于固定中间目标和基于惩罚的奖励方案。总的来说,GOBS-MARL 在有效性和效率方面都表现出显著优势,推理时间仅略有增加。这一性能主要归因于贝叶斯更新机制和目标空间映射,它们动态地完善信念并优化目标,缓解稀疏奖励问题并提高训练效率。
D. Discussion D. 讨论
Advantages: A key strength of the framework is its integration of offline deployment pregeneration and real-time assignment decisions through online training. This approach not only supports long-term services for predefined tasks but also enables UAVs to efficiently adapt to unforeseen circumstances with dynamic demands. The coordination mechanism further enhances operational flexibility, facilitating seamless transitions between tasks.
优点:该框架的一个关键优势在于其通过在线训练整合了离线部署预生成和实时分配决策。这种方法不仅支持预定义任务的长期服务,还使无人机能够有效地适应具有动态需求的不预见情况。协调机制进一步增强了操作灵活性,促进了任务之间的无缝转换。
Limitations: The determination of the BRC threshold relies on extensive simulations and requires incorporating additional UAVs and task-specific details, which can be time consuming. In addition, the proposed method does not fully account for UAV failures, environmental uncertainties, or communication constraints that may arise during task execution.
局限性:BRC 阈值的确定依赖于大量的仿真,并且需要结合额外的无人机和特定任务细节,这可能耗时。此外,所提出的方法并未完全考虑任务执行过程中可能出现的无人机故障、环境不确定性或通信限制。
Practical applications and real-world deployment: A typical application of the proposed framework is supply delivery. For validation, real-world experiments can be conducted in an indoor scenario, where UAVs are required to visit scattered ground users. As illustrated in Fig. 9, a ground station acts as a central server for offline computation, model training, and data processing. It transmits control commands and global planning decisions via the robot operating system (ROS). Each quadrotor, serving as the experiment platform, is equipped with a Pixhawk4 flight controller, onboard computing units (e.g., Raspberry Pi 4 with 8 GB RAM), visual sensors (e.g., RealSense Depth Camera D435i for environmental perception), and communication modules (e.g., wireless local area networks). Ground users are represented by mobile robots, offering mobility and flexibility for dynamic deployment. Each robot is assigned a unique task demand value, denoted by numerical labels detectable by cameras. Based on the described environment and hardware setup, real-world deployment can be summarized as follows.
实际应用与真实世界部署:所提出的框架的一个典型应用是物资配送。为了验证,可以在室内场景中进行真实世界实验,其中无人机需要访问分散的地面用户。如图 9 所示,地面站作为离线计算、模型训练和数据处理的中央服务器。它通过机器人操作系统(ROS)传输控制指令和全局规划决策。每个四旋翼无人机作为实验平台,配备了 Pixhawk4 飞行控制器、机载计算单元(例如,配备 8 GB RAM 的 Raspberry Pi 4)、视觉传感器(例如,用于环境感知的 RealSense Depth Camera D435i)和通信模块(例如,无线局域网)。地面用户由移动机器人表示,为动态部署提供移动性和灵活性。每个机器人被分配一个唯一的任务需求值,由相机可检测的数值标签表示。基于所描述的环境和硬件设置,真实世界部署可以总结如下。
Step 1: Collect and process environmental data, perform offline computation, and pretrain the GOBS-MARL model.
第一步:收集和处理环境数据,执行离线计算,并预训练 GOBS-MARL 模型。
Step 2: Generate deployment decisions, transmit the results, and deploy the trained network to each quadrotor.
第二步:生成部署决策,传输结果,并将训练好的网络部署到每个四旋翼。
Step 3: Gather environmental data, update the local belief space, and generate real-time task assignment decisions.
第 3 步:收集环境数据,更新本地信念空间,并生成实时任务分配决策。
Step 4: Modify the distributions and quantities of mobile robots, detect dynamic requests via onboard cameras, relay the information to the ground station, and determine whether to trigger replanning decisions.
第 4 步:修改移动机器人的分布和数量,通过机载摄像头检测动态请求,将信息转发到地面站,并确定是否触发重规划决策。
Step 5: Conduct performance evaluation and implement experimental improvements.
第 5 步:进行性能评估并实施实验改进。
Environmental data are obtained using LiDAR sensor and processed within the ROS architecture. Performance metrics include, but are not limited to, computational efficiency (e.g., average decision-making time per quadrotor), power consumption (e.g., battery life under varying workloads), and adaptability to dynamic environments (e.g., response time to dynamic requests). These metrics provide valuable insights into the algorithm's real-world applicability and robustness.
环境数据通过 LiDAR 传感器获取,并在 ROS 架构内进行处理。性能指标包括但不限于计算效率(例如,每个四旋翼的平均决策时间)、功耗(例如,不同工作负载下的电池寿命)以及对动态环境的适应性(例如,对动态请求的响应时间)。这些指标为算法的实际应用性和鲁棒性提供了宝贵的见解。
Conclusion 结论
The proposed framework integrates and coordinates deployment and assignment subproblems for multi-UAV mission replanning. Unlike existing approaches, our framework applies high-level management principles to long-term missions in geo-distributed environments, combining offline computation for global objectives with online training for real-time decision-making. This design effectively addresses current operational demands while mitigating unforeseen challenges. The SADCK-Medoid algorithm demonstrates superior convergence and global search capabilities, while the GOBS-MARL algorithm enhances learning efficiency for real-time task assignment. Experimental results validate the framework's effectiveness, showing significant performance improvements over SOTA methods in dynamic and complex environments.
所提出的框架集成了多无人机任务重规划的部署和分配子问题,并进行协调。与现有方法不同,我们的框架在地理分布式环境中对长期任务应用高级管理原则,结合离线计算以实现全局目标,以及在线训练以进行实时决策。这种设计有效应对当前运营需求,同时减轻未预见的挑战。SADCK-Medoid 算法展示了卓越的收敛性和全局搜索能力,而 GOBS-MARL 算法提高了实时任务分配的学习效率。实验结果验证了该框架的有效性,在动态和复杂环境中,其性能显著优于 SOTA 方法。
Future research would focus on conducting theoretical analyses to refine parameter selection for the algorithms. Further testing on a broader range of datasets is necessary. In practical applications, future work could investigate UAV dynamics.
未来的研究将集中于进行理论分析,以优化算法的参数选择。还需要在更广泛的数据集上进行进一步测试。在实际应用中,未来的工作可以研究无人机的动力学。