Journals & Magazines >IEEE Transactions on Industri... >Volume: 21 Issue: 7
期刊与杂志 > IEEE 工业交易... > 卷：21 期：7

Replanning-Oriented Framework for Efficient Real-Time Decision-Making in Multi-UAV Systems
面向重规划的框架：用于多无人机系统的实时高效决策 EI检索SCI升级版计算机科学1区SCI基础版工程技术1区IF 9.9

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Efficient real-time decision-making for long-term multiple unmanned aerial vehicles (multi-UAV) missions in geo-distributed environments requires an integrated approach t...Show More

Metadata

Abstract: 摘要：

Efficient real-time decision-making for long-term multiple unmanned aerial vehicles (multi-UAV) missions in geo-distributed environments requires an integrated approach to manage dynamic task demands. We propose a hierarchical dual-layer decision-making framework for multi-UAV mission replanning. The upper layer optimizes multi-UAV deployment using the density-constrained K-medoids clustering and simulated annealing algorithm, achieving globally optimal solutions. The lower layer addresses task assignment via the goal-oriented belief space multiagent reinforcement learning algorithm, which leverages updated belief distributions to mitigate sparse reward and enhance training efficiency. Coordination between the two layers ensures comprehensive coverage of predefined demands while adapting to dynamic events. The effectiveness of the proposed methods is validated through a real-world case study using the 911 call dataset from city emergency services.
在地理分布式环境中，针对长期多无人机（multi-UAV）任务进行高效的实时决策，需要一种集成方法来管理动态任务需求。我们提出了一种用于多无人机任务重规划的分层双层决策框架。上层使用密度约束的 K-medoids 聚类和模拟退火算法优化多无人机部署，实现全局最优解。下层通过目标导向的信念空间多智能体强化学习算法解决任务分配问题，该算法利用更新的信念分布来缓解稀疏奖励并提高训练效率。两层之间的协调确保了预定义需求的全面覆盖，同时适应动态事件。通过使用城市应急服务中 911 呼叫数据集进行的真实案例研究，验证了所提出方法的有效性。

Published in: IEEE Transactions on Industrial Informatics ( Volume: 21, Issue: 7, July 2025)
发表于：IEEE 工业信息学汇刊（卷：21，期：7，2025 年 7 月）

Page(s): 5127 - 5137 页码：5127 - 5137

Date of Publication: 08 April 2025
发布日期：2025 年 4 月 8 日

ISSN Information: ISSN 信息：

DOI: 10.1109/TII.2025.3549789

Contents

SECTION I. 第一部分。

Introduction 引言

Multiple unmanned aerial vehicles (multi-UAV) have increasingly been applied in long-term urban services, such as geohazards monitoring [1], data collection [2], and medical supply delivery [3]. To ensure sustained operations in geo-distributed environments, mission planning must prioritize both UAV deployment and task assignment, enabling effective system collaboration and mission continuity.
多个无人机系统（multi-UAV）越来越多地应用于长期城市服务，如地质灾害监测 [1] 、数据收集 [2] 和医疗物资配送 [3] 。为确保在地理分布式环境中持续运行，任务规划必须优先考虑无人机部署和任务分配，以实现有效的系统协作和任务连续性。

While integrated schemes for multi-UAV deployment and task assignment manage ground users across diverse locations sustainably [4], unforeseen circumstances often exceed predefined plans [5]. This necessitates robust replanning strategies for on-demand dispatching and real-time decision-making to respond to evolving task requests and emergencies. Addressing these challenges requires tailored frameworks and advanced methodologies to coordinate interdependent subproblems.
虽然针对多无人机部署和任务分配的集成方案可持续地管理不同地点的地面用户 [4] ，但意外情况往往超出预定义计划 [5] 。这需要强大的重规划策略，以应对实时调度和动态任务请求及紧急情况。应对这些挑战需要定制框架和先进方法来协调相互依赖的子问题。

In this article, we propose a replanning-oriented framework for efficient real-time decision-making in dynamic multi-UAV environments. Our approach supports long-term mission execution in geo-distributed settings with dynamically emerging ground users, aligning with real-world applications, such as emergency medical services and disaster relief.
在本文中，我们提出了一种面向重规划的框架，用于在动态多无人机环境中进行高效的实时决策。我们的方法支持在地理分布式设置中执行长期任务，具有动态涌现的地面用户，并与现实世界的应用相一致，例如紧急医疗服务和灾害救援。

A. Related Works A. 相关工作

1) Joint Decision-Making Framework
1) 联合决策框架

Recent research has focused on addressing the interdependencies between planning subproblems by employing a joint decision-making framework. By developing dual-layer architectures, the joint optimization of multi-UAV deployment and task assignment has been effectively applied in various practical scenarios, including UAV-enabled mobile edge computing [4], [6], helicopter-UAVs search and rescue operations [7], and aerial base stations for emergency communication networks [8], [9]. The aforementioned work highlights the effectiveness of hierarchical planning schemes in managing scattered ground sites and enhancing long-term performances. However, they overlook the dynamic demands that may arise in real-world applications, which have been stated in [10] as one of their future directions. While dynamic task demands have been addressed through a reassignment strategy [11], the real-time responsiveness may be constrained by the algorithm under investigation. Despite the improved decision-making efficiency in handling unexpected events brought from reinforcement learning [12], there remains a lack of integration of key subproblems to effectively tackle replanning-oriented challenges.
近期研究主要集中于通过采用联合决策框架解决规划子问题之间的相互依赖关系。通过开发双层架构，多 UAV 部署和任务分配的联合优化已有效地应用于各种实际场景，包括 UAV 支持的移动边缘计算 [4] 、 [6] 、直升机-UAV 搜救行动 [7] 以及应急通信网络的空中基站 [8] 、 [9] 。上述工作突出了分层规划方案在管理分散地面站点和提升长期性能方面的有效性。然而，它们忽略了实际应用中可能出现的动态需求，这已在 [10] 中被提出作为其未来方向。虽然通过重新分配策略 [11] 已解决了动态任务需求，但实时响应能力可能受所研究算法的限制。尽管在处理强化学习 [12] 带来的意外事件方面提高了决策效率，但关键子问题的整合仍然不足，无法有效应对面向重规划的挑战。

2) Clustering-Based Multi-UAV Deployment
2) 基于聚类的多无人机部署

The primary concern for deployment decisions is having a sufficient number of UAVs to ensure comprehensive on-demand coverage for ground users [13]. This emphasis on mission completion aligns with the simplicity of clustering algorithms, which have demonstrated effectiveness in generating feasible solutions in mutli-UAV systems [8], [14], [15]. To achieve optimal deployment outcomes, various modifications have been implemented to the K-means clustering algorithm. These include but not limited to a genetic algorithm-enhanced method with a refined Q-learning mechanism [16], the integration of iterative selection and association policies [17], and the hybrid approach based on the expectation-maximization algorithm [18]. Despite ongoing efforts to refine UAV deployment strategies, the susceptibility of K-means algorithm centroids to noise remains a challenge. Research indicates that the utilization of medoids as centroids can significantly enhance robustness against outliers, providing a superior alternative to traditional K-means in UAV deployment scenarios [19], [20]. However, the predetermined number of clusters in K-medoids algorithm restricts its adaptability to fluctuating user demands. In addition, achieving convergence to a global optimum is challenging without the integration of external assistance strategies.
部署决策的主要关注点是有足够数量的无人机以确保为地面用户提供全面的按需覆盖 [13] 。这种对任务完成的强调与聚类算法的简单性相一致，聚类算法已在多无人机系统中证明了生成可行解决方案的有效性 [8] , [14] , [15] 。为了实现最佳的部署结果，对 K-means 聚类算法进行了各种修改。这些修改包括但不限于遗传算法增强的方法，具有精细 Q 学习机制 [16] ，迭代选择和关联策略的集成 [17] ，以及基于期望最大化算法的混合方法 [18] 。尽管在改进无人机部署策略方面持续努力，但 K-means 算法质心对噪声的敏感性仍然是一个挑战。研究表明，使用中位数作为质心可以显著增强对异常值的鲁棒性，为无人机部署场景提供了优于传统 K-means 的替代方案 [19] , [20] 。然而，K-medoids 算法中预定的聚类数量限制了其对波动用户需求的适应性。此外，如果没有集成外部辅助策略，要实现收敛到全局最优是很具挑战性的。

3) Learning-Based Task Assignment
3) 基于学习的任务分配

Deep Q-learning (DQL) has proven effective in endowing UAV agents with real-time decision-making capabilities, especially for task assignment problems [12], [21]. The partial observability and local information in real-world multi-UAV operations necessitate a fully decentralized training paradigm, prompting the rise of multiagent reinforcement learning (MARL) frameworks [10], [22], [23]. In response, Dai et al. [24] proposed a collaborative MARL algorithm for UAV networks, leveraging deep Q-networks (DQN) to train local models. This approach significantly improves dynamic resource allocation, enhancing both adaptability, and operational efficiency of the system. To deal with the large state and action spaces, Li et al. [25] incorporated an ε-greedy policy into an onboard DQN-based resource allocation scheme, balancing exploration and exploitation, and used experience replay to efficiently manage space expansion. Similar mechanisms were employed to optimize ground sensor selection, where experience replay was used to enhance learning efficiency by breaking the correlation between consecutive samples [26]. Despite these modifications, the results show that sparse rewards in the expansive state space hinder the agent's ability to effectively explore the environment through transition probabilities. In addition, dynamic task demands complicate the training process, reducing learning efficiency and impeding convergence to the optimal policy.
深度 Q 学习（DQL）已被证明在赋予无人机代理实时决策能力方面非常有效，特别是在任务分配问题 [12] 、 [21] 上。现实世界多无人机操作中的部分可观察性和局部信息需要一种完全去中心化的训练范式，从而推动了多智能体强化学习（MARL）框架 [10] 、 [22] 、 [23] 的出现。为此，Dai 等人 [24] 提出了一种用于无人机网络的协作 MARL 算法，利用深度 Q 网络（DQN）来训练本地模型。这种方法显著提高了动态资源分配，增强了系统的适应性和运营效率。为了处理大的状态和动作空间，Li 等人 [25] 将 ε -贪婪策略结合到基于 DQN 的机上资源分配方案中，平衡了探索和利用，并使用经验回放来高效管理空间扩展。类似的机制被用于优化地面传感器选择，其中经验回放通过打破连续样本之间的相关性来提高学习效率 [26] 。尽管进行了这些修改，结果表明，在扩展状态空间中的稀疏奖励阻碍了智能体通过转移概率有效探索环境的能力。此外，动态任务需求使训练过程复杂化，降低了学习效率，并阻碍了收敛到最优策略。

B. Motivations B. 动机

Following a comprehensive literature review, we identified a research gap in replanning-oriented real-time decision-making for multi-UAV deployment and task assignment. The specific motivations for this article are outlined as follows.
在全面回顾文献后，我们确定了在多无人机部署和任务分配方面面向重规划的实时决策研究空白。本文的具体动机如下。

Existing research underscores the importance of hierarchical integration of deployment and assignment in addressing long-term mission execution in geo-distributed environments. However, a replanning-oriented framework is required—one that coordinates these coupled subproblems to enable not only optimal solutions but also efficient real-time decision-making in response to dynamic demands.
现有研究表明，在地理分布式环境中执行长期任务时，部署和分配的分层集成非常重要。然而，需要一种面向重规划的框架——该框架协调这些耦合的子问题，以实现不仅是最优解，而且在应对动态需求时能够进行高效的实时决策。
While modified K-means clustering algorithms effectively generate feasible deployment solutions for multi-UAV systems, unresolved challenges in existing methods impede the achievement of robust and convergent globally optimal results for long-term missions with potential changes.
虽然改进的 K-means 聚类算法能够有效地为多无人机系统生成可行的部署方案，但现有方法中未解决的问题阻碍了在可能发生变化的长期任务中实现鲁棒和收敛的全局最优结果。
Although MARL-based approaches enhance adaptability and operational efficiency in dynamic task assignment, sparse rewards in large state spaces and increased mission complexity hinder effective exploration and reduce learning efficiency in achieving a convergent optimal policy.
尽管基于 MARL 的方法在动态任务分配中增强了适应性和运行效率，但大型状态空间中的稀疏奖励和任务复杂性的增加阻碍了有效探索，并降低了学习效率以实现收敛最优策略。

C. Main Contributions C. 主要贡献

We propose a dual-layer framework for real-time decision-making in multi-UAV mission replanning. Unlike existing hierarchical designs for multi-UAV deployment and task assignment, our approach simultaneously addresses predefined task coverage and responsiveness to dynamic demands. The main contributions are summarized as follows.
我们提出了一种用于多无人机任务重规划的实时决策双层框架。与现有的多无人机部署和任务分配的分层设计不同，我们的方法同时解决了预定义的任务覆盖范围和对动态需求响应的问题。主要贡献总结如下。

We propose a density-constrained K-medoids clustering and simulated annealing (SADCK-Medoid) algorithm for upper layer deployment. This novel approach integrates dynamic partitioning with iterative balanced remaining capability (BRC) minimization, ensuring near-optimal convergence, balanced redundancy for long-term adaptability, and superior clustering quality with reduced efficiency tradeoffs.
我们提出了一种用于上层部署的密度约束 K-medoids 聚类和模拟退火（SADCK-Medoid）算法。这种新颖的方法将动态分区与迭代平衡剩余能力（BRC）最小化相结合，确保近似最优收敛、长期适应性的平衡冗余以及减少效率权衡的优越聚类质量。
We propose a goal-oriented belief space MARL (GOBS-MARL) algorithm for real-time task assignment. The method incorporates a Bayesian updating mechanism and goal space mapping into DQL, deriving optimal policies from dynamically updating belief distributions instead of fixed intermediate goals, thereby addressing sparse reward challenges and improving training efficiency.
我们提出了一种面向目标的信念空间多智能体强化学习（GOBS-MARL）算法用于实时任务分配。该方法将贝叶斯更新机制和目标空间映射集成到深度 Q 学习（DQL）中，从动态更新的信念分布中推导出最优策略，而不是固定的中间目标，从而解决稀疏奖励挑战并提高训练效率。
Extensive simulations confirm the effectiveness of the proposed framework in facilitating efficient real-time decision-making and generating robust replanning solutions. Specifically, the SADCK-Medoid algorithm outperformed its individual components (K-medoids [27] and simulated annealing (SA) algorithm [28]) and the well-known K-means [29] in global convergence, while achieving competitive clustering quality compared to state-of-the-art (SOTA) K-means variants. The GOBS-MARL algorithm exhibited superior performance over MARL baselines, achieving the highest average reward and the lowest running time, highlighting its adaptability to dynamic environments.
广泛的仿真验证了所提出的框架在促进高效实时决策和生成鲁棒重规划解决方案方面的有效性。具体而言，SADCK-Medoid 算法在其各个组成部分（K-medoids [27] 和模拟退火（SA）算法 [28] ）以及众所周知的 K-means [29] 中表现优于全局收敛，同时在与最先进（SOTA）K-means 变体的聚类质量方面具有竞争力。GOBS-MARL 算法在 MARL 基线中表现出优越的性能，实现了最高的平均奖励和最短的运行时间，突显了其在动态环境中的适应性。

D. Organization D. 组织结构

The rest of this article is organized as follows. Section II presents preliminaries of the work. Section III describes the proposed framework and algorithms. In Section IV, we analyze the simulation results and conduct a discussion. Finally, Section V concludes this article.
本文其余部分组织如下。第 II 节介绍工作基础。第 III 节描述所提出的框架和算法。在第 IV 节中，我们分析仿真结果并进行讨论。最后，第 V 节总结本文。

SECTION II. 第二节

Preliminaries 预备知识

We consider a typical scenario, as illustrated in Fig. 1(a), characterized by distinct geographical task regions scattered across an urban area. In this scenario, N UAVs, denoted by the set V={v1,v2,…,vN}, provide parcel delivery services and emergency medical supply allocation [15] for K different communities, denoted by the set A={A1,A2,…,AK}. The mission involves visiting a finite set of M static ground users, Γ={τ1,τ2,…,τM}, scattered across different task regions, and responding in real time to M′ unpredictable events, analogous to dynamic requests at time step t. The set of additional users is denoted by Γ′(t)={τ′1(t),τ′2(t),…,τ′M′(t)}. In real-world scenarios, these unscripted events progressively manifest over time intervals.
我们考虑一个典型场景，如图 1(a) 所示，该场景的特点是在城市区域内分布着不同的地理任务区域。在这个场景中， N 个无人机，记为集合 V={v1,v2,…,vN} ，为 K 个不同的社区，记为集合 A={A1,A2,…,AK} 提供包裹递送服务和紧急医疗物资分配 [15] 。该任务涉及访问分布在不同任务区域中的有限个静态地面用户 M ，以及实时响应 M′ 预测性事件，类似于时间步 t 时的动态请求。附加用户的集合记为 Γ′(t)={τ′1(t),τ′2(t),…,τ′M′(t)} 。在现实场景中，这些非计划性事件会随着时间的推移逐渐显现。

Fig. 1. -
Diagram of the proposed replanning-oriented decision-making framework.

Fig. 1. 图 1.

Diagram of the proposed replanning-oriented decision-making framework.
提出的面向重规划的决策框架图。

Show All

In terms of UAV, we assume that each vi,i=1,2,…,N, in V can be modeled as a five-tuple ⟨li,si,θi,ci,fi⟩, where li=(xi,yi,zi), si>0, θi∈(0,π), ci∈N+, and fi>0 represent the location coordinates, constant airspeed, observation (azimuth and elevation) angle, capacity level, and maximum flight mileage, respectively. As depicted in Fig. 1(a), θi, as an angular value, represents the field of view of the camera mounted on vi, with a circular area of radius zi⋅tan(θi/2) on the ground. The variable ci denotes an operational ability linked to task demands, specifying the maximum limit of onboard parcels for vi. We define task regions such that each Ak,k=1,2,…,K is restricted to a bounded area. Each τj,j=1,2,…,M is represented by a predetermined location lj=(xj,yj,0) on the ground and a specific task demand value dj,dj∈N+, indicating the number of required parcels.
就无人机而言，我们假设 vi,i=1,2,…,N 中的每个 V 都可以建模为五元组 ⟨li,si,θi,ci,fi⟩ ，其中 li=(xi,yi,zi) 、 si>0 、 θi∈(0,π) 、 ci∈N+ 和 fi>0 分别表示位置坐标、恒定空速、观测（方位角和仰角）角度、容量水平和最大飞行里程。如图 1(a) 所示， θi 作为角度值，表示安装在 vi 上的摄像头的视野，在地面上的圆形区域半径为 zi⋅tan(θi/2) 。变量 ci 表示与任务需求相关的操作能力，指定 vi 的最大承载包裹限制。我们定义任务区域，使得每个 Ak,k=1,2,…,K 都被限制在一个有限区域内。每个 τj,j=1,2,…,M 由地面上的预定位置 lj=(xj,yj,0) 和特定的任务需求值 dj,dj∈N+ 表示， dj,dj∈N+ 表示所需的包裹数量。

Under centralized deployment, UAVs communicate via the central server, whereas distributed task assignment relies on local information exchanges within each subgroup, requiring UAVs to act as autonomous decision-makers with independent computing capabilities. The intermediary process between these two decisions is not taken into account.
在集中式部署下，无人机通过中央服务器进行通信，而分布式任务分配则依赖于每个子组内的本地信息交换，需要无人机作为具有独立计算能力的自主决策者。这两种决策之间的中介过程未被考虑。

A. BRC-Oriented Multi-UAV Deployment
A. 基于 BRC 的多无人机部署

The deployment decision divides a swarm of N drones into K+1 subgroups denoted by {V0,V1,…,VK}, where V0 represents the set of drones temporarily stationed at the base station, and V1,...,VK denote the subgroups deployed to A1,...,AK. To ensure full coverage of ground users and respond to unpredictable events, we utilize the concept of BRC for a given number of K subgroups within a swarm to achieve optimal deployment. This metric is calculated as follows:

B = 1 K \sum K k = 1 (C k - C ¯) 2 - - - - - - - - - - - - - - - - \sqrt (1)

View Source

where

C¯

represents the average remaining capability across

subgroups and

Ck>0

denotes the remaining capability of the

th subgroup, where

k=1,2,…,K

, such that

C k = \sum N k i = 1 ( c i - c ^ i ) N k \sum i = 1 N k \sum j = 1 M k (E i - ε j) . (2)

View Source

In (2), the first term on the right-hand side denotes a factor related to the remaining capacity level;

c^i=⌊cid¯k⌋

is an estimate of the capacity usage of

with

d¯k=M−1k∑Mkj=1dj

. The second term on the right-hand side of (2) is the remaining energy of

and

εj

, corresponding to the total energy of

and the consumption required to visit

τj

, are positively correlated with

[30].

and

are the number of drones and ground users, respectively, in the

th deployed regions. Serving as an indicator to prevent resource shortages in addressing task requests, the remaining capability

provides a conservative estimate of whether the

th subgroup can execute the tasks assigned to its task region. This ensures the availability of the

th subgroup and successful mission accomplishment.
部署决策将一群无人机

N 划分为

K+1 个子组，用

{V0,V1,…,VK} 表示，其中

V0 表示暂时驻扎在基站的一组无人机，而

V1 ，...，

VK 表示部署到

A1 ，...，

AK 的子组。为了确保地面用户的全面覆盖并应对不可预测的事件，我们利用 BRC 的概念，在一群无人机中对给定的

K 个子组实现最佳部署。该指标计算如下：

B = 1 K \sum K k = 1 (C k - C ¯) 2 - - - - - - - - - - - - - - - - \sqrt (1)

View Source

其中

C¯ 表示

K 个子组的平均剩余能力，

Ck>0 表示第

k 个子组的剩余能力，其中

k=1,2,…,K ，使得

C k = \sum N k i = 1 ( c i - c ^ i ) N k \sum i = 1 N k \sum j = 1 M k (E i - ε j) . (2)

View Source

在 (2) 中，右侧第一项表示与剩余容量水平相关的因素；

c^i=⌊cid¯k⌋ 是

vi 的容量使用估计，其中

d¯k=M−1k∑Mkj=1dj 。 (2) 右侧的第二项是

Vk 的剩余能量。

Ei 和

εj 分别对应

vi 的总能量以及访问

τj 所需的消耗，它们与

fi [30] 正相关。

Nk 和

Mk 分别是第

k 个部署区域的无人机和地面用户的数量。作为防止在处理任务请求时出现资源短缺的指标，剩余能力

Ck 提供了一种保守的估计，用于判断第

k 个子组是否能够执行分配给其任务区域的任务。这确保了第

k 个子组的可用性，并保证了任务的顺利完成。

By applying (1), differences among subgroups across distinct task regions can be quantified. A lower B value indicates greater reasonableness among subgroups, promoting balanced deployment and avoiding overload.
通过应用 (1) ，可以量化不同任务区域中子组之间的差异。 B 值越低，子组之间的合理性越高，促进均衡部署并避免过载。

B. Decentralized-Partially Observable Markov Decision Process (Dec-POMDP)-Based Task Assignment
B. 基于去中心化部分可观察马尔可夫决策过程（Dec-POMDP）的任务分配

The task assignment is conducted within each task region. Each vi,i=1,2,…,Nk in Vk functions as an autonomous decision-maker, tasked with obtaining and executing a sequence of ground user requests within θi, represented as {τk1,τk2,…,τkm} in Ak={τ1,τ2,…,τMk}. This is modeled as a Dec-POMDP, denoted by Z=⟨V,S,O,b,A,R,P,T,G,g,γ⟩, where drones are abstracted as a set of Nk agents V={1,2,…,Nk}, and γ is the discount factor. Detailed definitions are as follows.
任务分配在每个任务区域内进行。 Vk 中的每个 vi,i=1,2,…,Nk 充当自主决策者，负责在 θi 内获取并执行一系列地面用户请求，以 {τk1,τk2,…,τkm} 的形式表示在 Ak={τ1,τ2,…,τMk} 中。这被建模为 Dec-POMDP，记为 Z=⟨V,S,O,b,A,R,P,T,G,g,γ⟩ ，其中无人机被抽象为一组 Nk 代理 V={1,2,…,Nk} ，而 γ 是折扣因子。详细定义如下。

State: We define a state in time slot t as st, st∈S, where st={lti,ltj,sti,θti,cti,fti,dtj} and S denotes the set of states.
状态：我们在时间槽 t 中定义状态为 st ， st∈S ，其中 st={lti,ltj,sti,θti,cti,fti,dtj} 和 S 表示状态集。

Action: We define the action chosen by vi in time slot t as ati∈Ai, where ati={aei,asi,awi,ani,arefi}. Each element represents a movement in different directions (east, south, west, and north) or a refusal to move arefi. Each drone selects an action from the discretized action space A={A1,…,ANk} based on the current policy and the observed information.
动作：我们在时间槽 t 中定义 vi 选择的行为为 ati∈Ai ，其中 ati={aei,asi,awi,ani,arefi} 。每个元素表示在不同方向（东、南、西、北）中的移动或拒绝移动 arefi 。每架无人机根据当前策略和观察到的信息从离散动作空间 A={A1,…,ANk} 中选择一个动作。

Transition probability: The transition probability P(st+1|st,at):st×at×st+1→[0,1] represents the likelihood of transitioning from state st to st+1 as a result of action at.
转移概率：转移概率 P(st+1|st,at):st×at×st+1→[0,1] 表示由于行为 at 从状态 st 转移到状态 st+1 的可能性。

Observation space and function: An agent i∈V can obtain a partial observation in time slot t as oti=(ltj,dtj,Xtj),oti∈O, which contains the information of existing ground users within its field of view. O denotes the observation function specifying the probability of the observation oti obtained by agent i from the environment after taking an action ati to reach a state st+1 in time slot t.
观察空间和函数：智能体 i∈V 在时间槽 t 可以获得部分观察结果 oti=(ltj,dtj,Xtj),oti∈O ，其中包含其视野内现有地面用户的信息。 O 表示观察函数，指定智能体 i 在执行动作 ati 到达时间槽 t 中的状态 st+1 后从环境中获得的观察结果 oti 的概率。

Reward: Each i∈V can obtain an individual reward rti according to the reward function Ri(st,at)∈R in time slot t, calculated as follows:

R t = \sum t = 0 T - 1 (\sum i = 1 N k r t i - \sum j = 1 M k r comm j, t) . (3)

View Source

Belief space: A belief state

bt∈b

represents the probability of dynamic task requests occurring in time slot

. The global belief space

is defined as a 2-D truncated discrete distribution over the task regions.
奖励：每个

i∈V 可以根据奖励函数

Ri(st,at)∈R 在时间槽

t 获得单独的奖励

rti ，计算方法如下：

R t = \sum t = 0 T - 1 (\sum i = 1 N k r t i - \sum j = 1 M k r comm j, t) . (3)

View Source

信念空间：信念状态

bt∈b 表示动态任务请求在时间槽

t 发生的概率。全局信念空间

b 定义为任务区域上的二维截断离散分布。

The belief space is updated using the Bayesian updating theorem, where the posterior belief at time t+1, denoted by bt+1, is derived from the prior belief bt, the action at, and the observation ot+1. The update is expressed as follows:

b t + 1 (s t + 1) = p ( o t + 1 | s t + 1 , a t ) p ( o t + 1 | b t , s t , a t ) \sum s t \in S p (s t + 1 | s t, a t) b t (s t) (4)

View Source

where the conditional probability

p∈R

is defined as follows:

p (o t + 1 | b t, s t, a t) = \sum s t + 1 \in S p (o t + 1 | s t + 1, a t) \sum s t \in S p (s t + 1 | s t, a t) b t (s t) . (5)

View Source

Goal space: A space

represents the set of final and intermediate goals. The mapping function

transforms the belief state

and the system state

into the goal space

, expressed as

g:(bt,st)→G

.
信念空间使用贝叶斯更新定理进行更新，其中时间

t+1 的后验信念

bt+1 由先验信念

bt 、动作

at 和观测

ot+1 推导而来。更新表示如下：

b t + 1 (s t + 1) = p ( o t + 1 | s t + 1 , a t ) p ( o t + 1 | b t , s t , a t ) \sum s t \in S p (s t + 1 | s t, a t) b t (s t) (4)

View Source

，其中条件概率

p∈R 定义如下：

p (o t + 1 | b t, s t, a t) = \sum s t + 1 \in S p (o t + 1 | s t + 1, a t) \sum s t \in S p (s t + 1 | s t, a t) b t (s t) . (5)

View Source

。目标空间：一个空间

G 表示最终和中间目标集合。映射函数

g 将信念状态

bt 和系统状态

st 转换为目标空间

G ，表示为

g:(bt,st)→G 。

C. Optimization-Based Problem Formulation
C. 基于优化的问题描述

The global objective is achieved in the deployment process, in which each subgroup Vk⊆V,k=1,2,…,K, achieves complete coverage of ground users within Ak. To address long-term mission execution and potential emergencies, the deployment subproblem, denoted as P1, is formulated by globally minimizing the BRC, expressed as follows:

P 1 : min B (6)

View Source

s.t.

⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪∑Nki=1(ci−ci^)⩾1∀i∈{1,2,…,Nk},k∈{1,2,…,K}Ei−ϵj >0∀i∈{1,2,…,Nk},j∈{1,2,…,Mk}Xi,k∈{0,1}∀i∈{1,2,…,Nk},k∈{1,2,…,K}(7)

View Source

where

Xi,k

is a binary variable indicating the presence of

in the

th subgroup. Specifically,

Xi,k=1

signifies that

is deployed to

, while

Xi,k=0

indicates otherwise.
全局目标在部署过程中得以实现，其中每个子组

Vk⊆V,k=1,2,…,K 在

Ak 内实现了对地面用户的完全覆盖。为了应对长期任务执行和潜在紧急情况，部署子问题被定义为

P1 ，通过全局最小化 BRC 来表述，如下所示：

P 1 : min B (6)

View Source

s.t.

⎧⎩⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪∑Nki=1(ci−ci^)⩾1∀i∈{1,2,…,Nk},k∈{1,2,…,K}Ei−ϵj >0∀i∈{1,2,…,Nk},j∈{1,2,…,Mk}Xi,k∈{0,1}∀i∈{1,2,…,Nk},k∈{1,2,…,K}(7)

View Source

其中

Xi,k 是一个二元变量，表示

vi 在第

k 个子组中的存在。具体来说，

Xi,k=1 表示

vi 部署到

Ak ，而

Xi,k=0 则表示相反。

The task assignment subproblem within each Ak,k=1,2,…,K, denoted as P2, is formulated based on the optimal Bellman equation [31]. The discounted optimal action-value function Q∗ for a finite horizon Dec-POMDP is expressed as follows:

P 2 : Q * (b t, s t, a t) = max \sum s t + 1 \in S P (s t + 1 | s t, a t) [r (s t + 1 | s t, a t) + γ V * (b t + 1, s t + 1)] . (8)

View Source

Here,

r(st+1|st,at)

represents the immediate reward obtained from the transition

P(st+1|st,at)

. This transition is partially influenced by

and fully determined by

. The optimal value function

V∗

represents the maximum expected cumulative reward achievable from a given belief state

and state

, assuming the optimal policy is followed thereafter.
每个

Ak,k=1,2,…,K 内的任务分配子问题，记作

P2 ，基于最优贝尔曼方程 [31] 进行构建。对于有限视界的 Dec-POMDP，折扣最优动作值函数

Q∗ 表示如下：

P 2 : Q * (b t, s t, a t) = max \sum s t + 1 \in S P (s t + 1 | s t, a t) [r (s t + 1 | s t, a t) + γ V * (b t + 1, s t + 1)] . (8)

View Source

其中，

r(st+1|st,at) 表示从状态转移

P(st+1|st,at) 中获得的即时奖励。该状态转移部分受

O 影响，完全由

S 决定。最优值函数

V∗ 表示在给定信念状态

bt 和状态

st 下，若后续遵循最优策略，所能实现的最大预期累积奖励。

SECTION III. 第三节

Proposed Framework and Algorithms
提出的框架和算法

A. Overview of the Decision-Making Framework
A. 决策框架概述

We implement the two-phase decision-making process through a hierarchical framework designed to optimize in a coordinated manner. As illustrated in Fig. 1(a), the upper layer deploys N drones across K+1 subgroups to K task regions. This is computed offline on a central server, where the SADCK-Medoid algorithm is employed to solve the problem P1 [upper box in Fig. 1(b)]. The lower layer addresses the distributed assignment of ground user sequences to each subgroup of agents. This is accomplished through online training using the proposed GOBS-MARL algorithm. By leveraging updated belief distributions, the system effectively captures environmental dynamics, providing hindsight experience with enhanced learning efficiency [lower box in Fig. 1(b)].
我们通过一个分层框架实现两阶段决策过程，该框架旨在以协调的方式优化。如图 1(a) 所示，上层将 N 个无人机部署到 K+1 个子组中，以到达 K 个任务区域。这在线下由中央服务器计算，其中采用 SADCK-Medoid 算法解决该问题 P1 [图 1(b) 中的上层框]。下层处理地面用户序列到每个智能体子组的分布式分配。这通过使用提出的 GOBS-MARL 算法在线训练来完成。通过利用更新的信念分布，系统有效地捕捉环境动态，提供具有增强学习效率的后见之明经验 [图 1(b) 中的下层框]。

To adapt to unpredictable events, coordination is facilitated through the BRC determination mechanism. As shown in Fig. 1(a), the central server collects current assignment results and updated requests, calculates the BRC value, and determines the need for replanning. This process is guided by the BRC threshold, Bthres∈R, which is quantified based on the system's status and mission requirements. Specifically, if B≤Bthres, the current deployment policy can accommodate updated requests without redeployment, allowing the central server to function as continuous monitor. Conversely, if Bthres is exceeded, a warning is triggered, signaling the need for replanning. This mechanism ensures real-time responsiveness and the generation of on-demand replanning policies.
为了适应不可预测的事件，通过 BRC 确定机制促进协调。如图 1(a) 所示，中央服务器收集当前的分配结果和更新请求，计算 BRC 值，并确定是否需要重新规划。该过程由 BRC 阈值 Bthres∈R 指导，该阈值根据系统状态和任务需求进行量化。具体来说，如果 B≤Bthres ，当前的部署策略可以容纳更新请求而无需重新部署，允许中央服务器作为持续监控器运行。相反，如果 Bthres 被超过，将触发警告，表明需要重新规划。该机制确保实时响应和按需生成重新规划策略。

B. SADCK-Medoid Algorithm
B. SADCK-中心点算法

As shown in Fig. 1(b), the K-medoids algorithm [27] classifies numerous ground users, while SA [28] allocates UAVs into task-specific subgroups. By executing these two processes in parallel, adaptability to dynamic changes is greatly enhanced. Given the set L={l1,l2,…,lM}, the objective is to minimize the within-cluster sum of squares, defined as

arg min Clus μ k \sum k = 1 K \sum l \in Clus k ∥ l - μ k ∥ 2 = 1 | Clus k | \sum l \in Clus k l (9) (10)

View Source

where

μk

denotes the centroid of cluster

Clusk

in Euclidean space,

|Clusk|

denotes the size of cluster

, and

∥⋅∥

is the standard

-norm. The density of the

th task region is defined as the number of drones deployed per square unit

Dens k = N k Size k (11)

View Source

where

Sizek∈R

denotes the actual area of the

th region, and

Nk∈R

is the number of UAVs deployed in the

th region. Since the convergence of K-medoids cannot be guaranteed due to the randomness in centroid initialization, only the first centroid is selected randomly, while the remaining centroids are chosen based on their distance to known centroids

Dist (l j) = arg min k ∥ l j - μ k ∥ . (12)

View Source

如图 1(b) 所示，K-medoids 算法 [27] 对众多地面用户进行分类，而 SA [28] 将无人机分配到特定任务的子组中。通过并行执行这两个过程，极大地增强了适应动态变化的能力。给定集合

L={l1,l2,…,lM} ，目标是最小化组内平方和，定义为

arg min Clus μ k \sum k = 1 K \sum l \in Clus k ∥ l - μ k ∥ 2 = 1 | Clus k | \sum l \in Clus k l (9) (10)

View Source

，其中

μk 表示欧几里得空间中第

Clusk 个簇的质心，

|Clusk| 表示第

k 个簇的大小，

∥⋅∥ 是标准

L2 -范数。第

k 个任务区域的密度定义为每平方单位的无人机部署数量

Dens k = N k Size k (11)

View Source

，其中

Sizek∈R 表示第

k 个区域的实际面积，

Nk∈R 是第

k 个区域中部署的无人机数量。由于 K-medoids 的收敛性无法保证，因为质心初始化存在随机性，因此仅随机选择第一个质心，而其余质心则根据其到已知质心的距离

Dist (l j) = arg min k ∥ l j - μ k ∥ . (12)

View Source

进行选择。

Targets with higher Dist(lj) values are more likely to be selected as centroids. The implementation steps of the SADCK-Medoid algorithm are detailed in Algorithm 1.
数值越高的目标越有可能被选为中心点。SADCK-Medoid 算法的实现步骤详见算法 1 。

Algorithm 1: SADCK-Medoid.
算法 1：SADCK-Medoid。

for each iteration i=1,2,…,I1 do
对每次迭代 i=1,2,…,I1 执行

Randomly initialize the first centroid μ1;
随机初始化第一个中心点 μ1 ；

Select remaining K−1 centroids based on (12);
根据 (12) 选择剩余的 K−1 个质心；

Initialize all clusters as empty: Clusk=∅,∀k∈K;
将所有聚类初始化为空： Clusk=∅,∀k∈K ；

for j=1,2,…,M do

Assign location lj to the nearest centroid;
将位置 lj 分配给最近的质心；

Calculate cluster density Densk via (11);
计算簇密度 Densk 通过 (11) ;

if Densk≥Densk′ then 如果 Densk≥Densk′

Reject and reassign to the next nearest cluster;
拒绝并重新分配到最近的下一个簇;

10:

else

11:

Update the centroid μk via (10);
通过 (10) 更新质心 μk ;

12:

end if if end

13:

end for for end

14:

Partition UAVs into K subgroups arbitrarily;
将无人机任意划分为 K 个子组;

15:

for i=1,2,…,N do

16:

Calculate Bnew according to (1);
根据 (1) 计算 Bnew ;

17:

for k=1,2,…,K do

18:

Move UAV vi to Clusk;
将无人机 vi 移动到 Clusk ;

19:

ifBnew<Bnow or exp(ΔBBmax)<rand(0,1) then
如果 Bnew<Bnow 或 exp(ΔBBmax)<rand(0,1) 则

20:

Accept the movement; 接受移动；

21:

else

22:

Reject the movement. 拒绝移动。

23:

end if if end

24:

end for for end

25:

end for for end

26:

end for for end

C. Goal-Oriented Belief Space MARL
C. 以目标为导向的信念空间多智能体强化学习

The proposed GOBS-MARL algorithm tackles real-time task assignment by integrating belief-oriented hindsight experience replay (HER) [32], MADQL, and a value decomposition network (VDN). The process encompasses initialization, action execution, belief updating, experience replay and training, and optimization, as detailed in Algorithm 2.
所提出的 GOBS-MARL 算法通过整合以信念为导向的后见之明经验重放（HER）、MADQL 和一个值分解网络（VDN）来解决实时任务分配问题。该过程包括初始化、动作执行、信念更新、经验重放和训练以及优化，如算法 2 所示。

Belief-oriented HER: Actions aligned with the belief distribution and moving in the correct direction are assigned positive rewards. At each time step, agent experiences, denoted as Trt={bt,st,at,rt,bt+1,st+1}, are stored in a dataset Tr={Tr1,…,TrM}, which aggregates experiences across multiple episodes into a replay memory. These experiences are sampled and updated as Trnewt, derived from the mapping function g. The updated reward rnewt exceeds rt when UAVs move in the correct direction. To incentivize such actions, an additional reward of 20% of the average rewards is introduced.
基于信念的 HER：与信念分布一致且朝正确方向采取的行动会被赋予正奖励。在每个时间步，智能体经历（记为 Trt={bt,st,at,rt,bt+1,st+1} ）被存储在数据集 Tr={Tr1,…,TrM} 中，该数据集将多个回合的经历聚合到回放内存中。这些经历被采样并更新为 Trnewt ，其来源于映射函数 g 。当无人机朝正确方向移动时，更新的奖励 rnewt 会超过 rt 。为了激励此类行为，引入了平均奖励的 20% 作为额外奖励。

MADQL: The action-value function based on a policy π is defined as

Q π (b t, s t, a t) = E [U t | b t, s t, a t] (13)

View Source

where

represents the expected cumulative reward, starting from state

and belief

, by taking action

under policy

. The total discounted reward

from time

is:

U t = \sum t = 0 T - t γ t r (s t + 1 | s t, a t) (14)

View Source

where

γ∈[0,1]

and

r(st+1|st,at)

is the immediate reward obtained by transitioning from

st+1

via

.
MADQL：基于策略

π 的动作值函数定义为

Q π (b t, s t, a t) = E [U t | b t, s t, a t] (13)

View Source

，其中

E 表示从状态

st 和信念

bt 开始，在策略

π 下采取动作

at 时的预期累积奖励。从时间

t 开始的总折扣奖励

Ut 为：

U t = \sum t = 0 T - t γ t r (s t + 1 | s t, a t) (14)

View Source

，其中

γ∈[0,1] 和

r(st+1|st,at) 是通过从

st 经由

at 转移到

st+1 获得的即时奖励。

The optimal value function V∗ is expressed as

V * (b t, s t,) = max π E [U t | b t, s t] . (15)

View Source

Using samples from the environment, a temporal difference (TD) method estimates the current value function via Mont Carlo approximation

Q * (b t, s t, a t) \approx r t + γ max π Q * (b t + 1, s t + 1, a t + 1) . (16)

View Source

This formulation enables agents to iteratively improve their policies by maximizing expected rewards over time, making it well-suited for dynamic environments.
最优值函数

V∗ 表达为

V * (b t, s t,) = max π E [U t | b t, s t] . (15)

View Source

。使用环境中的样本，时序差分（TD）方法通过蒙特卡洛近似

Q * (b t, s t, a t) \approx r t + γ max π Q * (b t + 1, s t + 1, a t + 1) . (16)

View Source

估计当前值函数。这种公式使智能体能够通过在时间上最大化预期奖励来迭代地改进其策略，非常适合动态环境。

VDN: Assuming full cooperation among agents, all agents share the same reward function. However, joint actions At, states st, and the global belief space bt are essential for evaluating the global Q-function. To address instability, a VDN is introduced, which decomposes the global Q-function into individual agent Q-functions rather than learning it directly. The decomposition is defined as

Q * total (b t, s t, a t) = f (Q * 1 (b t, s t, a t), Q * 2 (b t, s t, a t), \dots Q * n v (b t, s t, a t)) (17)

View Source

subject to

\partial Q * total \partial Q * i ⩾ 0 \forall i .

View Source

This restriction ensures that the joint local optima contribute to the global optimum. In assignment problems, UAVs are equally weighted, and the global reward is defined as the sum of individual rewards. Consequently, a linear decomposition function

f(⋅)

is well-suited for this purpose.
VDN：假设代理之间完全合作，所有代理共享相同的奖励函数。然而，联合行动

At 、状态

st 和全局信念空间

bt 对于评估全局 Q 函数至关重要。为了解决不稳定性问题，引入了 VDN，它将全局 Q 函数分解为各个代理的 Q 函数，而不是直接学习它。分解定义为

Q * total (b t, s t, a t) = f (Q * 1 (b t, s t, a t), Q * 2 (b t, s t, a t), \dots Q * n v (b t, s t, a t)) (17)

View Source

，满足约束

\partial Q * total \partial Q * i ⩾ 0 \forall i .

View Source

。这个约束确保联合局部最优能够贡献于全局最优。在分配问题中，无人机权重相等，全局奖励定义为各个奖励的总和。因此，线性分解函数

f(⋅) 非常适合此目的。

Algorithm 2: GOBS-MARL. 算法 2：GOBS-MARL

Initialize the environment, Q-functions, target functions, replay buffer D, and parameters.
初始化环境、Q 函数、目标函数、重放缓冲区 D 以及参数。

for each episode i=1,2,…,I2 do
对每个回合 i=1,2,…,I2 进行

for each timestep t=1,2,…,T do
对每个时间步 t=1,2,…,T 进行

Select individual actions for each agent using the current Q-function and a greedy policy as in (13);
使用当前的 Q 函数和贪婪策略（如 (13) ）为每个智能体选择个体动作；

Execute joint actions At={at,1,at,2,…,at,Nk};
执行联合动作 At={at,1,at,2,…,at,Nk} ;

Observe Rt and ot+1 based on (3).
基于 (3) 观察 Rt 和 ot+1 。

Update belief space bt+1 using (4) and (5).
使用 (4) 和 (5) 更新信念空间 bt+1 。

Store transition Trt into D;
将状态转移 Trt 存入 D ;

if Replay buffer D contains sufficient experience then
如果重放缓冲区 D 包含足够的经验，则

10:

Sample minibatch Tr based on g;
基于 g 对最小批样本 Tr 进行采样 ;

11:

for each sample m=1,2,…,M do
对每个样本 m=1,2,…,M 进行

12:

Compute target value ym based on (16).
根据 (16) 计算 ym 的目标值。

13:

end for for end

14:

Compute the total TD loss across the minibatch;
计算整个小批量中的总 TD 损失；

15:

Update network parameters {θ1,θ2,…,θnv}.
更新网络参数 {θ1,θ2,…,θnv} 。

16:

end if if end

17:

end for for end

18:

end for 结束 for 循环

D. Convergence Analysis D. 收敛性分析

Theorem 1:

The action-value function Q(bt,st,at) converges to the optimal action-value function Q∗(bt,st,at) with probability 1.

定理 1：动作值函数

Q(bt,st,at) 以概率 1 收敛到最优动作值函数

Q∗(bt,st,at) 。

Proof:

The action-value function is updated using the learning rate β∈(0,1) as

Q (b t + 1, s t + 1, a t + 1; θ) = Q (b t, s t, a t; θ) + β [r t + γ max a t + 1 Q (b t + 1, s t + 1, a t + 1; θ^) - Q (b t, s t, a t; θ)] . (18)

View Source

Rewriting this as

Q (b t + 1, s t + 1, a t + 1; θ) = (1 - β) Q (b t, s t, a t; θ) + β [r t + γ max a t + 1 Q (b t + 1, s t + 1, a t + 1; θ^)] . (19)

View Source

Define the error

Δt=Q∗(b,s,a;θ)−Q(bt,st,at;θ)

. The update becomes

Δ t + 1 = (1 - β) Δ t + β [Q * (b, s, a; θ) - r t + γ max a t + 1 Q (b t + 1, s t + 1, a t + 1; θ^)] . (20)

View Source

Define

F t (b t, s t, a t) = Q * (b, s, a; θ) - r t + γ max a t + 1 Q (b t + 1, s t + 1, a t + 1; θ^) (21)

View Source

where

is a contraction mapping under the maximum norm. According to the theorems in [33], the iterative process

Δt+1=(1−β)Δt+βFt(bt,st,at)

converges to zero if the four assumptions in [33] are satisfied.

证明：动作值函数使用学习率

β∈(0,1) 更新为

Q (b t + 1, s t + 1, a t + 1; θ) = Q (b t, s t, a t; θ) + β [r t + γ max a t + 1 Q (b t + 1, s t + 1, a t + 1; θ^) - Q (b t, s t, a t; θ)] . (18)

View Source

。将其重写为

Q (b t + 1, s t + 1, a t + 1; θ) = (1 - β) Q (b t, s t, a t; θ) + β [r t + γ max a t + 1 Q (b t + 1, s t + 1, a t + 1; θ^)] . (19)

View Source

。定义误差

Δt=Q∗(b,s,a;θ)−Q(bt,st,at;θ) 。更新变为

Δ t + 1 = (1 - β) Δ t + β [Q * (b, s, a; θ) - r t + γ max a t + 1 Q (b t + 1, s t + 1, a t + 1; θ^)] . (20)

View Source

。定义

F t (b t, s t, a t) = Q * (b, s, a; θ) - r t + γ max a t + 1 Q (b t + 1, s t + 1, a t + 1; θ^) (21)

View Source

，其中

Ft 在最大范数下是一个压缩映射。根据 [33] 中的定理，如果满足 [33] 中的四个假设，迭代过程

Δt+1=(1−β)Δt+βFt(bt,st,at) 收敛于零。

The state space S is finite, which holds true in this problem formulation. The learning rate β∈(0,1) is constant, satisfying the required conditions. The conditional expectation of Ft(bt,st,at) given history H is bounded

max a t ∥ E {F t (b t, s t, a t) | H, β} ∥ \leq γ max a t ∥ Δ t ∥ . (22)

View Source

Here, the expectation bounds the difference between the optimal and current action-value functions. The variance of

Ft(bt,st,at)

is bounded

Var {F t (b t, s t, a t) | H, β} \leq C (1 + ∥ Δ t ∥) 2 (23)

View Source

where

C<∞

is a constant. This condition is satisfied if the variance of

is bounded, and

depends on

Q(bt,st,at;θ)

at most linearly.
状态空间

S 是有限的，这在问题形式化中成立。学习率

β∈(0,1) 是常数，满足所需条件。给定历史

H 的

Ft(bt,st,at) 条件期望有界

max a t ∥ E {F t (b t, s t, a t) | H, β} ∥ \leq γ max a t ∥ Δ t ∥ . (22)

View Source

。在这里，期望界定了最优和当前动作值函数之间的差异。

Ft(bt,st,at) 的方差有界

Var {F t (b t, s t, a t) | H, β} \leq C (1 + ∥ Δ t ∥) 2 (23)

View Source

，其中

C<∞ 是一个常数。如果

rt 的方差有界，并且

Ft 对

Q(bt,st,at;θ) 至多线性依赖，则此条件得到满足。

Since all assumptions hold, Δt→0, and Q(bt,st,at) converges to Q∗(bt,st,at) with probability 1.
由于所有假设均成立， Δt→0 ，并且 Q(bt,st,at) 以概率 1 收敛到 Q∗(bt,st,at) 。

This completes the proof of Theorem 1.
这完成了定理 1 的证明。

SECTION IV. 第四节。

Simulation Results and Discussion
仿真结果与讨论

A. Dataset and Simulation Setup
A. 数据集和仿真设置

We evaluated the proposed methods through simulations using real-world datasets. The distribution of ground users was generated based on the Emergency 911 Calls dataset [34], which contains 664 000 records of emergency activities in the USA over five years. Each record includes eight key attributes detailing locations, times, types, and requirements of specific events. We extracted a subset of 152 points within a 24-h period, representing emergency events in a 50× 50 km2 residential area in Pennsylvania, USA. The multi-UAV system parameters were derived from the DJI FlyCart 30, an 8-propeller cargo quadcopter designed for heavy-duty deliveries, with a maximum load capacity of 30 kg, a range of 28 km, and a flight speed of 20 m/s. These parameters ensure alignment with real-world deployment scenarios.
我们使用真实世界数据集通过仿真评估了所提出的方法。地面用户的分布基于美国五年内的紧急活动记录的紧急 911 呼叫数据集 [34] 生成，该数据集包含 664000 条记录。每条记录包括八个关键属性，详细说明了特定事件的位置、时间、类型和要求。我们从 24 小时内提取了 152 个点，代表美国宾夕法尼亚州 50×50 km 的 2 住宅区的紧急事件。多无人机系统参数来自 DJI FlyCart 30，这是一种为重型配送设计的 8 旋翼货运四旋翼飞行器，最大负载能力为 30 kg，续航里程为 28 km，飞行速度为 20 m/s。这些参数确保与实际部署场景一致。

We adopted a centralized training and decentralized execution scheme. During training, agent experiences were collected in a shared replay buffer, sampled randomly, and updated uniformly using the Bayesian theorem. In the execution phase, each agent independently made decisions using its own neural network, comprising four hidden layers with 128, 128, 64, and 64 neurons, respectively. The model was trained over 500 episodes, with each episode consisting of 100 execution steps.
我们采用了一种集中式训练和分布式执行方案。在训练过程中，智能体经验被收集到一个共享的回放缓冲区中，随机采样，并使用贝叶斯定理进行均匀更新。在执行阶段，每个智能体独立地使用自己的神经网络进行决策，该神经网络包含四个隐藏层，分别具有 128、128、64 和 64 个神经元。该模型在 500 个回合上进行了训练，每个回合由 100 个执行步骤组成。

For upper layer optimization, we evaluated convergence using the Silhouette score, Calinski–Harabasz score, and Davies–Bouldin index (Dbi). The global objective was to minimize the BRC, with an ideal threshold of Bthres=30, determined from statistical data. We compared the proposed algorithm against two types of baselines: 1) individual components of the proposed method (K-medoids [27] and SA algorithm [28]) and 2) SOTA clustering methods (K-means [29], PGMeans [35], sub-KMeans [36], and DipMeans [37]). Algorithm complexity and running time were recorded for performance comparisons. For lower layer decisions, we used average rewards, computational complexity (measured in FLOPs and inference time), and running time as evaluation metrics. The multiagent DQN (MADQN) algorithm [38] served as a baseline to demonstrate the enhancements of our modifications. In addition, two intermediate goal-oriented rules (GOBS-A and GOBS-B) were introduced to validate the effectiveness of the designed reward mechanism. We also included advanced baselines, MAA2C [39] and VDPPO [40], representing actor-critic and policy-based methods, respectively. These benchmarks cover value-based, actor–critic, and policy-based approaches, ensuring a comprehensive comparison. All methods shared the same environmental parameters for fairness.
对于上层优化，我们使用轮廓系数、Calinski–Harabasz 系数和 Davies–Bouldin 指数（Dbi）评估了收敛性。全局目标是使 BRC 最小化，理想阈值为 Bthres=30 ，该值根据统计数据确定。我们将所提出的算法与两种类型的基线进行了比较：1) 所提出方法中的单个组件（K-medoids [27] 和 SA 算法 [28] ）以及 2) SOTA 聚类方法（K-means [29] 、PGMeans [35] 、sub-KMeans [36] 和 DipMeans [37] ）。记录了算法复杂度和运行时间以进行性能比较。对于下层决策，我们使用平均奖励、计算复杂度（以 FLOPs 和推理时间衡量）以及运行时间作为评估指标。多智能体 DQN（MADQN）算法 [38] 作为基线，以展示我们改进的效果。此外，引入了两个以目标为导向的中间规则（GOBS-A 和 GOBS-B），以验证所设计的奖励机制的有效性。我们还包含了先进的基线，MAA2C [39] 和 VDPPO [40] ，分别代表 Actor-Critic 方法和基于策略的方法。这些基准测试涵盖了基于价值、演员-评论家和基于策略的方法，确保了全面的比较。所有方法共享相同的环境参数以确保公平性。

The proposed approach was implemented using PyTorch in Python on a system with an AMD Ryzen 7 5800H CPU, 3060 Laptop GPU, and Windows 10. Table I provides the empirical parameter settings.
所提出的方法使用 Python 中的 PyTorch 在一台配备 AMD Ryzen 7 5800H CPU、3060 笔记本 GPU 和 Windows 10 的系统上实现。表 I 提供了经验参数设置。

TABLE I Parameter Setting
表 I 参数设置

B. Performance Analysis of Upper Layer Deployment
B. 上层部署的性能分析

To evaluate the effectiveness of the proposed algorithm, we first analyzed its convergence properties using indices with 95% confidence intervals. Fig. 2(a) shows the Silhouette score, which measures cluster dissimilarity, converging at 0.5. Fig. 2(b) illustrates the Calinski–Harabasz score, representing the ratio of between-cluster to within-cluster scatter, peaking at approximately 250. Fig. 2(c) presents the Dbi curve, a measures of cluster compactness, stabilizing at 0.65 after around 120 iterations. These results demonstrate the algorithm's robust convergence and effectiveness in achieving well-balanced and compact clusters for multi-UAV deployment.
为了评估所提出算法的有效性，我们首先使用具有 95%置信区间的指标分析了其收敛特性。图 2(a) 显示了轮廓分数，该分数衡量簇间差异度，收敛于 0.5。图 2(b) 说明了 Calinski–Harabasz 分数，表示簇间散度与簇内散度的比率，在约 250 处达到峰值。图 2(c) 展示了 Dbi 曲线，这是一种衡量簇紧密度指标，在约 120 次迭代后稳定在 0.65。这些结果表明该算法具有稳健的收敛性和有效性，能够在多 UAV 部署中实现平衡且紧凑的簇。

Fig. 2. -
Convergence properties of SADCK-Medoid algorithm. (a) Silhouette score. (b) Calinski–Harabasz score. (c) Davies–Bouldin score.

Fig. 2. 图 2.

Convergence properties of SADCK-Medoid algorithm. (a) Silhouette score. (b) Calinski–Harabasz score. (c) Davies–Bouldin score.
SADCK-Medoid 算法的收敛特性。(a)轮廓系数。(b)Calinski–Harabasz 指数。(c)Davies–Bouldin 指数。

Show All

Second, we analyzed deployment results in a dynamic environment with 152 ground users (circles). The 24-h period was divided into three equal stages, with the central server updating replanning decisions at the end of each stage based on new information. Fig. 3 illustrates clustering outcomes for ground users and emergency events after a two-stage updating process (inverted triangles and regular triangles, respectively). Ground users were divided into five task regions, each represented by a distinct color. Initial deployment results, shown in Fig. 4(a), allocated 4, 6, 3, 7, and 6 drones to regions with 9, 17, 9, 17, and 12 users, respectively, leaving fourteen drones at the base. After the first emergency, user numbers increased in all regions except TR1, but deployment remained unchanged. Following the second emergency, significant user increases prompted redeployment, with 6, 9, 5, 10, and 7 drones allocated per region, and reinforcement drones dispatched from the base. Fig. 4(b) shows the evolution of the BRC across three stages. In the first stage, the BRC converged to 19.11 using SADCK-Medoid. In the second stage, the BRC rose to 27.27 but remained below the threshold, avoiding redeployment. In the third stage, the BRC exceeded the threshold but reconverged to an acceptable level with algorithmic assistance. These results demonstrate the algorithm's ability to adaptively manage multi-UAV deployment, ensuring balanced dispatch and readiness for emerging demands.
其次，我们在一个包含 152 名地面用户（圆形）的动态环境中分析了部署结果。24 小时周期被分为三个相等阶段，中央服务器在每个阶段结束时根据新信息更新重规划决策。图 3 展示了经过两阶段更新过程后地面用户和紧急事件的聚类结果（分别为倒三角形和普通三角形）。地面用户被划分为五个任务区域，每个区域用不同的颜色表示。初始部署结果如图 4(a) 所示，分别为 9、17、9、17 和 12 名用户的区域分配了 4、6、3、7 和 6 架无人机，基地留有十四架无人机。第一次紧急事件发生后，除 TR1 区域外所有区域的用户数量增加，但部署保持不变。第二次紧急事件后，由于用户数量显著增加，进行了重新部署，每个区域分别分配了 6、9、5、10 和 7 架无人机，并从基地派出增援无人机。图 4(b) 显示了 BRC 在三个阶段中的演变过程。在第一阶段，BRC 使用 SADCK-Medoid 收敛到 19.11。在第二阶段，BRC 上升到 27.27 但仍然低于阈值，避免了重新部署。在第三阶段，BRC 超过了阈值，但在算法辅助下重新收敛到可接受的水平。这些结果表明该算法能够自适应地管理多 UAV 部署，确保平衡调度和对新需求的响应能力。

Fig. 3. -
Clustering results of ground users in dynamic environments.

Fig. 3. 图 3.

Clustering results of ground users in dynamic environments.
动态环境中地面用户的聚类结果。

Show All

Fig. 4. -
Deployment results and BRC evolution curves in dynamic environments. (a) Number of deployed drones and task changes in three stages. (b) BRC evolution curves in different stages.

Fig. 4. 图 4.

Deployment results and BRC evolution curves in dynamic environments. (a) Number of deployed drones and task changes in three stages. (b) BRC evolution curves in different stages.
动态环境中的部署结果和 BRC 演化曲线。(a)三个阶段中部署的无人机数量和任务变化。(b)不同阶段的 BRC 演化曲线。

Show All

To further validate the proposed approach, we conducted an ablation study and compared SADCK-Medoid with constrained K-medoids [27], SA [28], and K-means [29] in minimizing global BRC under varying numbers of ground users. As shown in Fig. 5, SADCK-Medoid consistently outperformed its individual components and K-means, achieving impressive results with 150 ground users by minimizing the BRC value within 250 iterations [see Fig. 5(a)]. Although additional iterations were required for lager numbers of users [see Fig. 5(b) and (c)], it maintained competitive global convergence performance.
为了进一步验证所提出的方法，我们进行了一项消融研究，并将 SADCK-Medoid 与约束 K-medoids [27] 、SA [28] 和 K-means [29] 在不同数量的地面用户下最小化全局 BRC 进行了比较。如图 5 所示，SADCK-Medoid 始终优于其各个组成部分和 K-means，通过在 250 次迭代内最小化 BRC 值，使用 150 个地面用户取得了令人印象深刻的结果 [见图 5(a) ]。尽管对于更多用户需要额外的迭代 [见图 5(b) 和 (c) ]，但它保持了具有竞争力的全局收敛性能。

Fig. 5. -
Comparative curves of BRC evolution across varying numbers of ground users. (a) Total number of 150 ground users. (b) Total number of 200 ground users. (c) Total number of 250 ground users.

Fig. 5. 图 5.

Comparative curves of BRC evolution across varying numbers of ground users. (a) Total number of 150 ground users. (b) Total number of 200 ground users. (c) Total number of 250 ground users.
BRC 演化在不同地面用户数量下的对比曲线。(a) 150 个地面用户的总数。(b) 200 个地面用户的总数。(c) 250 个地面用户的总数。

Show All

The comparison, summarized in Table II, highlights the proposed algorithm's balance between quality and efficiency. With a time complexity of O(2NI(M+1)MNc), its structure reflects additional computational steps for dynamic partitioning and redundancy balancing. Despite this added complexity, the algorithm achieves superior clustering quality, evidenced by the lowest Dbi value of 0.605, outperforming baseline methods. While its running time of 523 s is moderate—faster than SA and sub-KMeans but slower than K-medoids, DipMeans, PGMeans, and K-means—it demonstrates an effective tradeoff, offering high-quality results for complex scenarios.
表 II 中的比较突出了所提出算法在质量和效率之间的平衡。具有时间复杂度 O(2NI(M+1)MNc) ，其结构反映了动态分区和冗余平衡的额外计算步骤。尽管增加了复杂性，该算法实现了卓越的聚类质量，以 0.605 的最低 Dbi 值为证，优于基线方法。虽然其运行时间为 523 秒，属于中等水平——比 SA 和亚 KMeans 快，但比 K-medoids、DipMeans、PGMeans 和 K-means 慢，但它展示了有效的权衡，为复杂场景提供了高质量的结果。

TABLE II Performance Comparison in Deployment
表 II 部署性能比较

C. Performance Analysis of Lower Layer Assignment
C. 下层分配的性能分析

Task assignment was conducted for each subgroup within their respective task regions based on the upper layer deployment outcomes, utilizing the proposed GOBS-MARL algorithm. Fig. 6 presents the task assignment results across five subgraphs, each representing a specific task region. The geographical distributions align with the deployment results, ensuring comprehensive ground user coverage. Initially, drones were strategically positioned near existing targets to maximize the likelihood of detecting unpredictable events. As new ground users emerged, drones executed tasks guided by the global BRC objective and their capacity (indicated by light-colored circular coverage areas). Redeployment was activated when necessary, employing additional drones to ensure updated ground users were adequately covered (dark circular coverage areas). Apparently, the well-trained GOBS-MARL effectively ensures comprehensive task coverage while adapting to dynamic ground user updates.
任务分配基于上层部署结果，在各自的任务区域内对每个子组进行，利用所提出的 GOBS-MARL 算法。图 6 展示了跨越五个子图的分配结果，每个子图代表一个特定的任务区域。地理分布与部署结果一致，确保了全面的地面用户覆盖。最初，无人机被策略性地部署在现有目标附近，以最大化检测不可预测事件的可能性。随着新的地面用户出现，无人机在全局 BRC 目标和其容量（由浅色圆形覆盖区域表示）的指导下执行任务。当需要时，会启动重新部署，使用额外的无人机以确保更新的地面用户得到充分覆盖（深色圆形覆盖区域）。显然，训练良好的 GOBS-MARL 有效地确保了全面的任务覆盖，同时适应动态地面用户更新。

Fig. 6. -
Coverage of ground users in each task region based on GOBS-MARL algorithm. (a) TR1. (b) TR2. (c) TR3. (d) TR4. (e) TR5.

Fig. 6. 图 6.

Coverage of ground users in each task region based on GOBS-MARL algorithm. (a) TR1. (b) TR2. (c) TR3. (d) TR4. (e) TR5.
基于 GOBS-MARL 算法的每个任务区域的地面用户覆盖情况。(a) TR1。(b) TR2。(c) TR3。(d) TR4。(e) TR5。

Show All

To evaluate the effectiveness of the assignment policies learned by GOBS-MARL, we compared its training rewards with those of the MADQN [38] across each task region. Using the third stage as an an example, Fig. 7 illustrates the learning curve comparisons for each task region. Fig. 7(b) highlights GOBS-MARL's significant early-phase efficiency improvement. In Fig. 7(a), (d), and (e), both algorithms show sharp initial increases in average rewards, followed by steady increments and smooth progress after approximately 100 episodes. Notably, GOBS-MARL consistently achieves the highest average reward in these regions. Although the performance appears similar in Fig. 7(c), the cumulative rewards across all task regions, as shown in Fig. 8, clearly demonstrate the superior efficiency and training performance of the proposed GOBS-MARL algorithm. As shown, it exhibits an earlier upward trend, achieving higher rewards after approximately 20 episodes and maintaining stability thereafter. This result highlights the effectiveness of goal-oriented belief space updating mechanism in improving learning efficiency.
为了评估 GOBS-MARL 所学习的分配策略的有效性，我们将它的训练奖励与 MADQN [38] 在每个任务区域中的奖励进行了比较。以第三阶段为例，图 7 展示了每个任务区域的学习曲线比较。图 7(b) 突出了 GOBS-MARL 在早期阶段的显著效率提升。在图 7(a) 、 (d) 和 (e) 中，两种算法的平均奖励都显示出急剧的初始增长，随后在约 100 个回合后稳步增加并平滑进展。值得注意的是，GOBS-MARL 在这些区域始终能获得最高的平均奖励。尽管图 7(c) 中的性能看起来相似，但如图 8 所示，所有任务区域的累积奖励清楚地展示了所提出的 GOBS-MARL 算法的优越效率和训练性能。如图所示，它表现出更早的上行趋势，在约 20 个回合后获得更高的奖励，并在此后保持稳定。这一结果突出了目标导向信念空间更新机制在提高学习效率方面的有效性。

Fig. 7. -
Comparison of the learning curves of GOBS-MARL and MADQN for each task region. (a) TR1. (b) TR2. (c) TR3. (d) TR4. (e) TR5.

Fig. 7. 图 7.

Comparison of the learning curves of GOBS-MARL and MADQN for each task region. (a) TR1. (b) TR2. (c) TR3. (d) TR4. (e) TR5.
对 GOBS-MARL 和 MADQN 在每个任务区域的学习曲线比较。(a) TR1。(b) TR2。(c) TR3。(d) TR4。(e) TR5。

Show All

Fig. 8. -
Comparison of the sum of rewards for all task regions.

Fig. 8. 图 8.

Comparison of the sum of rewards for all task regions.
所有任务区域的奖励总和比较。

Show All

Table III summarizes the statistical results of the evaluated algorithms. GOBS-MARL achieves the highest average reward of 12 290, surpassing all compared methods. It demonstrates efficiency with a shorter inference time of 0.286 s and fewer FLOPs (28) compared to MAA2C and VDPPO. While MADQN shows lower computational complexity, it suffers from significantly higher training (3823) and testing (186) times. In contrast, GOBS-MARL significantly reduces these times to 3779 and 184, respectively, highlighting its suitability for real-time decision-making. In addition, comparisons among GOBS-MARL, GOBS-A, and GOBS-B reveal that the belief-based trend learning mechanism outperforms fixed intermediate goals and penalty-based reward scheme. In general, GOBS-MARL demonstrates notable advantages in both effectiveness and efficiency, with only a minimal increase in inference time. This performance is largely attributed to the Bayesian updating mechanism and goal space mapping, which dynamically refine beliefs and optimize goals, mitigating sparse rewards and enhancing training efficiency.
表 III 总结了评估算法的统计结果。GOBS-MARL 实现了 12,290 的最高平均奖励，超越了所有比较方法。它展示了效率，具有更短的推理时间（0.286 秒）和更少的 FLOPs（28），与 MAA2C 和 VDPPO 相比。虽然 MADQN 的计算复杂度较低，但它受到训练时间（3823）和测试时间（186）显著增加的影响。相比之下，GOBS-MARL 将这两个时间分别显著减少到 3779 和 184，突显了其在实时决策中的适用性。此外，GOBS-MARL、GOBS-A 和 GOBS-B 之间的比较表明，基于信念的趋势学习机制优于固定中间目标和基于惩罚的奖励方案。总的来说，GOBS-MARL 在有效性和效率方面都表现出显著优势，推理时间仅略有增加。这一性能主要归因于贝叶斯更新机制和目标空间映射，它们动态地完善信念并优化目标，缓解稀疏奖励问题并提高训练效率。

TABLE III Performance Comparison in Assignment
表 III 分配性能比较

D. Discussion D. 讨论

Advantages: A key strength of the framework is its integration of offline deployment pregeneration and real-time assignment decisions through online training. This approach not only supports long-term services for predefined tasks but also enables UAVs to efficiently adapt to unforeseen circumstances with dynamic demands. The coordination mechanism further enhances operational flexibility, facilitating seamless transitions between tasks.
优点：该框架的一个关键优势在于其通过在线训练整合了离线部署预生成和实时分配决策。这种方法不仅支持预定义任务的长期服务，还使无人机能够有效地适应具有动态需求的不预见情况。协调机制进一步增强了操作灵活性，促进了任务之间的无缝转换。

Limitations: The determination of the BRC threshold relies on extensive simulations and requires incorporating additional UAVs and task-specific details, which can be time consuming. In addition, the proposed method does not fully account for UAV failures, environmental uncertainties, or communication constraints that may arise during task execution.
局限性：BRC 阈值的确定依赖于大量的仿真，并且需要结合额外的无人机和特定任务细节，这可能耗时。此外，所提出的方法并未完全考虑任务执行过程中可能出现的无人机故障、环境不确定性或通信限制。

Practical applications and real-world deployment: A typical application of the proposed framework is supply delivery. For validation, real-world experiments can be conducted in an indoor scenario, where UAVs are required to visit scattered ground users. As illustrated in Fig. 9, a ground station acts as a central server for offline computation, model training, and data processing. It transmits control commands and global planning decisions via the robot operating system (ROS). Each quadrotor, serving as the experiment platform, is equipped with a Pixhawk4 flight controller, onboard computing units (e.g., Raspberry Pi 4 with 8 GB RAM), visual sensors (e.g., RealSense Depth Camera D435i for environmental perception), and communication modules (e.g., wireless local area networks). Ground users are represented by mobile robots, offering mobility and flexibility for dynamic deployment. Each robot is assigned a unique task demand value, denoted by numerical labels detectable by cameras. Based on the described environment and hardware setup, real-world deployment can be summarized as follows.
实际应用与真实世界部署：所提出的框架的一个典型应用是物资配送。为了验证，可以在室内场景中进行真实世界实验，其中无人机需要访问分散的地面用户。如图 9 所示，地面站作为离线计算、模型训练和数据处理的中央服务器。它通过机器人操作系统（ROS）传输控制指令和全局规划决策。每个四旋翼无人机作为实验平台，配备了 Pixhawk4 飞行控制器、机载计算单元（例如，配备 8 GB RAM 的 Raspberry Pi 4）、视觉传感器（例如，用于环境感知的 RealSense Depth Camera D435i）和通信模块（例如，无线局域网）。地面用户由移动机器人表示，为动态部署提供移动性和灵活性。每个机器人被分配一个唯一的任务需求值，由相机可检测的数值标签表示。基于所描述的环境和硬件设置，真实世界部署可以总结如下。

Fig. 9. -
Snapshot of the test environment for real-world evaluation.

Fig. 9. 图 9.

Snapshot of the test environment for real-world evaluation.
真实世界评估的测试环境快照。

Show All

Step 1: Collect and process environmental data, perform offline computation, and pretrain the GOBS-MARL model.
第一步：收集和处理环境数据，执行离线计算，并预训练 GOBS-MARL 模型。

Step 2: Generate deployment decisions, transmit the results, and deploy the trained network to each quadrotor.
第二步：生成部署决策，传输结果，并将训练好的网络部署到每个四旋翼。

Step 3: Gather environmental data, update the local belief space, and generate real-time task assignment decisions.
第 3 步：收集环境数据，更新本地信念空间，并生成实时任务分配决策。

Step 4: Modify the distributions and quantities of mobile robots, detect dynamic requests via onboard cameras, relay the information to the ground station, and determine whether to trigger replanning decisions.
第 4 步：修改移动机器人的分布和数量，通过机载摄像头检测动态请求，将信息转发到地面站，并确定是否触发重规划决策。

Step 5: Conduct performance evaluation and implement experimental improvements.
第 5 步：进行性能评估并实施实验改进。

Environmental data are obtained using LiDAR sensor and processed within the ROS architecture. Performance metrics include, but are not limited to, computational efficiency (e.g., average decision-making time per quadrotor), power consumption (e.g., battery life under varying workloads), and adaptability to dynamic environments (e.g., response time to dynamic requests). These metrics provide valuable insights into the algorithm's real-world applicability and robustness.
环境数据通过 LiDAR 传感器获取，并在 ROS 架构内进行处理。性能指标包括但不限于计算效率（例如，每个四旋翼的平均决策时间）、功耗（例如，不同工作负载下的电池寿命）以及对动态环境的适应性（例如，对动态请求的响应时间）。这些指标为算法的实际应用性和鲁棒性提供了宝贵的见解。

SECTION V. 第五节

Conclusion 结论

The proposed framework integrates and coordinates deployment and assignment subproblems for multi-UAV mission replanning. Unlike existing approaches, our framework applies high-level management principles to long-term missions in geo-distributed environments, combining offline computation for global objectives with online training for real-time decision-making. This design effectively addresses current operational demands while mitigating unforeseen challenges. The SADCK-Medoid algorithm demonstrates superior convergence and global search capabilities, while the GOBS-MARL algorithm enhances learning efficiency for real-time task assignment. Experimental results validate the framework's effectiveness, showing significant performance improvements over SOTA methods in dynamic and complex environments.
所提出的框架集成了多无人机任务重规划的部署和分配子问题，并进行协调。与现有方法不同，我们的框架在地理分布式环境中对长期任务应用高级管理原则，结合离线计算以实现全局目标，以及在线训练以进行实时决策。这种设计有效应对当前运营需求，同时减轻未预见的挑战。SADCK-Medoid 算法展示了卓越的收敛性和全局搜索能力，而 GOBS-MARL 算法提高了实时任务分配的学习效率。实验结果验证了该框架的有效性，在动态和复杂环境中，其性能显著优于 SOTA 方法。

Future research would focus on conducting theoretical analyses to refine parameter selection for the algorithms. Further testing on a broader range of datasets is necessary. In practical applications, future work could investigate UAV dynamics.
未来的研究将集中于进行理论分析，以优化算法的参数选择。还需要在更广泛的数据集上进行进一步测试。在实际应用中，未来的工作可以研究无人机的动力学。

Replanning-Oriented Framework for Efficient Real-Time Decision-Making in Multi-UAV Systems
面向重规划的框架：用于多无人机系统的实时高效决策 EI检索SCI升级版计算机科学1区SCI基础版工程技术1区IF 9.9

Abstract:

Metadata

Abstract: 摘要：

ISSN Information: ISSN 信息：

Introduction 引言

A. Related Works A. 相关工作

1) Joint Decision-Making Framework
1) 联合决策框架

2) Clustering-Based Multi-UAV Deployment
2) 基于聚类的多无人机部署

3) Learning-Based Task Assignment
3) 基于学习的任务分配

B. Motivations B. 动机

C. Main Contributions C. 主要贡献

D. Organization D. 组织结构

Preliminaries 预备知识

A. BRC-Oriented Multi-UAV Deployment
A. 基于 BRC 的多无人机部署

B. Decentralized-Partially Observable Markov Decision Process (Dec-POMDP)-Based Task Assignment
B. 基于去中心化部分可观察马尔可夫决策过程（Dec-POMDP）的任务分配

C. Optimization-Based Problem Formulation
C. 基于优化的问题描述

Proposed Framework and Algorithms
提出的框架和算法

A. Overview of the Decision-Making Framework
A. 决策框架概述

B. SADCK-Medoid Algorithm
B. SADCK-中心点算法

Algorithm 1: SADCK-Medoid.
算法 1：SADCK-Medoid。

C. Goal-Oriented Belief Space MARL
C. 以目标为导向的信念空间多智能体强化学习

Algorithm 2: GOBS-MARL. 算法 2：GOBS-MARL

D. Convergence Analysis D. 收敛性分析

Theorem 1:

Proof:

Simulation Results and Discussion
仿真结果与讨论

A. Dataset and Simulation Setup
A. 数据集和仿真设置

B. Performance Analysis of Upper Layer Deployment
B. 上层部署的性能分析

C. Performance Analysis of Lower Layer Assignment
C. 下层分配的性能分析

D. Discussion D. 讨论

Conclusion 结论

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Replanning-Oriented Framework for Efficient Real-Time Decision-Making in Multi-UAV Systems面向重规划的框架：用于多无人机系统的实时高效决策 EI检索SCI升级版 计算机科学1区SCI基础版 工程技术1区IF 9.9

Alerts

Abstract:

Metadata

Abstract: 摘要：

ISSN Information: ISSN 信息：

Introduction 引言

A. Related Works A. 相关工作

1) Joint Decision-Making Framework1) 联合决策框架

2) Clustering-Based Multi-UAV Deployment2) 基于聚类的多无人机部署

3) Learning-Based Task Assignment3) 基于学习的任务分配

B. Motivations B. 动机

C. Main Contributions C. 主要贡献

D. Organization D. 组织结构

Preliminaries 预备知识

A. BRC-Oriented Multi-UAV DeploymentA. 基于 BRC 的多无人机部署

B. Decentralized-Partially Observable Markov Decision Process (Dec-POMDP)-Based Task AssignmentB. 基于去中心化部分可观察马尔可夫决策过程（Dec-POMDP）的任务分配

C. Optimization-Based Problem FormulationC. 基于优化的问题描述

Proposed Framework and Algorithms提出的框架和算法

A. Overview of the Decision-Making FrameworkA. 决策框架概述

B. SADCK-Medoid AlgorithmB. SADCK-中心点算法

Algorithm 1: SADCK-Medoid.算法 1：SADCK-Medoid。

C. Goal-Oriented Belief Space MARLC. 以目标为导向的信念空间多智能体强化学习

Algorithm 2: GOBS-MARL. 算法 2：GOBS-MARL

D. Convergence Analysis D. 收敛性分析

Theorem 1:

Proof:

Simulation Results and Discussion仿真结果与讨论

A. Dataset and Simulation SetupA. 数据集和仿真设置

B. Performance Analysis of Upper Layer DeploymentB. 上层部署的性能分析

C. Performance Analysis of Lower Layer AssignmentC. 下层分配的性能分析

D. Discussion D. 讨论

Conclusion 结论

Authors 作者

Figures 图

References 参考文献

Keywords 关键词

Metrics 指标

Footnotes 脚注

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Replanning-Oriented Framework for Efficient Real-Time Decision-Making in Multi-UAV Systems
面向重规划的框架：用于多无人机系统的实时高效决策 EI检索SCI升级版计算机科学1区SCI基础版工程技术1区IF 9.9

1) Joint Decision-Making Framework
1) 联合决策框架

2) Clustering-Based Multi-UAV Deployment
2) 基于聚类的多无人机部署

3) Learning-Based Task Assignment
3) 基于学习的任务分配

A. BRC-Oriented Multi-UAV Deployment
A. 基于 BRC 的多无人机部署

B. Decentralized-Partially Observable Markov Decision Process (Dec-POMDP)-Based Task Assignment
B. 基于去中心化部分可观察马尔可夫决策过程（Dec-POMDP）的任务分配

C. Optimization-Based Problem Formulation
C. 基于优化的问题描述

Proposed Framework and Algorithms
提出的框架和算法

A. Overview of the Decision-Making Framework
A. 决策框架概述

B. SADCK-Medoid Algorithm
B. SADCK-中心点算法

Algorithm 1: SADCK-Medoid.
算法 1：SADCK-Medoid。

C. Goal-Oriented Belief Space MARL
C. 以目标为导向的信念空间多智能体强化学习

Simulation Results and Discussion
仿真结果与讨论

A. Dataset and Simulation Setup
A. 数据集和仿真设置

B. Performance Analysis of Upper Layer Deployment
B. 上层部署的性能分析

C. Performance Analysis of Lower Layer Assignment
C. 下层分配的性能分析