基于因果影响检测的多无人机海上协同导航策略优化方法

柳文章; 陆建华; 任璐; 孙长银

doi:10.16383/j.aas.c250691

基于因果影响检测的多无人机海上协同导航策略优化方法

doi: 10.16383/j.aas.c250691 cstr: 32138.14.j.aas.c250691

柳文章^{1, 2, 3,},
陆建华^{1, 2, 3,},
任璐^{1, 2, 3,},
孙长银^{1, 2, 3,}

1.
安徽大学人工智能学院合肥 230601
2.
自主无人系统技术教育部工程研究中心合肥 230601
3.
安徽省安全人工智能重点实验室合肥 230601

基金项目: 国家自然科学基金(62303009, 62495083, 62236002), 安徽省教育厅高校科研项目青年项目(2025AHGXZK40374)资助

详细信息

作者简介:
柳文章：安徽大学人工智能学院讲师. 2021年获得东南大学控制科学与工程专业博士学位. 主要研究方向为深度强化学习, 具身智能系统, 多智能体强化学习, 迁移强化学习. E-mail: wzliu@ahu.edu.cn

陆建华：安徽大学人工智能学院硕士研究生. 主要研究方向为多智能体强化学习, 多无人机协同控制. E-mail: jhlu@stu.ahu.edu.cn

任璐：安徽大学人工智能学院副教授. 2021年获得东南大学控制科学与工程专业博士学位. 主要研究方向为自主无人系统分布式协同控制, 深度强化学习和多智能体强化学习. E-mail: penny_lu@ahu.edu.cn

孙长银：安徽大学人工智能学院教授. 2004年获得东南大学电子工程专业博士学位. 主要研究方向为智能控制, 飞行器控制, 模式识别和优化理论. 本文通信作者. E-mail: cysun@seu.edu.cn

计量
- 文章访问数: 133
- HTML全文浏览量: 61
- PDF下载量: 5
- 被引次数: 0
出版历程
- 收稿日期: 2025-11-30
- 录用日期: 2026-03-16
- 网络出版日期: 2026-04-22
- 刊出日期: 2026-05-20

Policy Optimization Method for Multi-UAV Cooperative Maritime Navigation Based on Causal Influence Detection

LIU Wen-Zhang^{1, 2, 3
,},
LU Jian-Hua^{1, 2, 3
,},
REN Lu^{1, 2, 3
,},
SUN Chang-Yin^{1, 2, 3
,}

1.
School of Artificial Intelligence, Anhui University, Hefei 230601
2.
Ministry of Education Engineering Research Center for Autonomous Unmanned Systems Technology, Hefei 230601
3.
Anhui Provincial Key Laboratory of Security Artificial Intelligence, Hefei 230601

Funds: Supported by National Natural Science Foundation of China (62303009, 62495083, 62236002) and Youth Research Project of Anhui Provincial Department of Education (2025AHGXZK40374)

More Information

Author Bio:
LIU Wen-Zhang　Lecturer at the School of Artificial Intelligence, Anhui University. He received his Ph.D. degree in control science and engineering from Southeast University in 2021. His research interests include deep reinforcement learning, embodied intelligent systems, multi-agent reinforcement learning, and transfer reinforcement learning

LU Jian-Hua　Master student at the School of Artificial Intelligence, Anhui University. His research interests include multi-agent reinforcement learning and multi-UAV cooperative control

REN Lu　Associate professor at the School of Artificial Intelligence, Anhui University. She received her Ph.D. degree in control science and engineering from Southeast University in 2021. Her research interests include distributed cooperative control of autonomous unmanned systems, deep reinforcement learning, and multi-agent reinforcement learning

SUN Chang-Yin　Professor at the School of Artificial Intelligence, Anhui University. He received his Ph.D. degree in electrical engineering from Southeast University in 2004. His research interests include intelligent control, flight control, pattern recognition, and optimal theory. Corresponding author of this paper

摘要

摘要: 多无人机协同导航是实现高效海上协同作业的重要技术. 然而, 在广阔且动态未知的海域中, 受限的感知能力与自主决策机制使无人机之间的协作关系复杂, 难以获取全局信息. 近年来, 基于集中训练与分散执行范式的多智能体强化学习在协作行为学习方面取得显著进展, 并被广泛应用于海上协同导航任务. 但由于智能体交互往往仅在特定情境下发生, 如何有效提升协作效率与探索能力仍是关键挑战. 为解决上述问题, 提出一种基于因果影响检测的多智能体近端策略优化方法. 该方法以智能体之间的因果影响为衡量准则, 引入基于协作规则设计的内在奖励机制, 利用因果推断与条件互信息来检测智能体之间在行为上的因果影响, 从而引导其优先探索对全局状态具有正向影响的动作, 强化多智能体间的合作. 实验结果表明, 所提方法表现出显著的性能提升, 尤其在海上搜救任务中展现出更高的协同效率, 验证了方法的有效性.
- 多无人机协同 /
- 近端策略优化 /
- 多智能体强化学习 /
- 因果影响检测
Abstract: Multi-UAV cooperative navigation is a crucial technology for achieving efficient cooperative maritime operations. However, in vast and dynamically unknown maritime environments, limited sensing capabilities and autonomous decision-making mechanism lead to complex cooperation relationships among UAVs, making it difficult to obtain global information. In recent years, multi-agent reinforcement learning under the centralized training and decentralized execution paradigm has achieved remarkable progress in learning cooperative behaviors and has been widely applied to cooperative maritime navigation tasks. Nevertheless, because agent interactions often occur only in specific situations, improving cooperation efficiency and exploration capability remains a major challenge. To address this issue, this paper proposes a causal influence detection for multi-agent proximal policy optimization method. The proposed method uses causal influence among agents as an evaluation metric and introduces an intrinsic reward mechanism designed based on cooperation rules. By leveraging causal inference and conditional mutual information, the method detects behavioral causal influence among agents, guiding them to preferentially explore actions that positively affect the global state and thus enhancing inter-agent cooperation. Experimental results demonstrate that the proposed method achieves significant performance improvements, especially in maritime search and rescue tasks, where it exhibits higher cooperation efficiency, validating the effectiveness of the method.
- multi-UAV cooperation /
- proximal policy optimization /
- multi-agent reinforcement learning /
- causal influence detection

HTML全文

图 1 多无人机协同导航示意图

Fig. 1 Diagram of multi-UAV cooperative navigation

下载: 全尺寸图片幻灯片

图 2 无人机的机体固定坐标系和惯性坐标系

Fig. 2 The body-fixed coordinate system and inertial coordinate system of UAV

下载: 全尺寸图片幻灯片

图 3 具有3个智能体的单步状态转移因果图模型

Fig. 3 The causal graphic model of one-step state transition with three agents

下载: 全尺寸图片幻灯片

图 4 不考虑智能体$i$动作的因果图模型

Fig. 4 The causal graphic model without considering the action of agent $i$

下载: 全尺寸图片幻灯片

图 5 $\beta{\text{-}}$VAE 框架示意图

Fig. 5 Diagram of $\beta{\text{-}}$VAE framework

下载: 全尺寸图片幻灯片

图 6 CID-MAPPO示意图

Fig. 6 Diagram of CID-MAPPO

下载: 全尺寸图片幻灯片

图 7 实验场景示意图

Fig. 7 Diagram of the experimental scenes

下载: 全尺寸图片幻灯片

图 8 不同场景下各算法学习曲线

Fig. 8 Learning curves of different algorithms under various scenarios

下载: 全尺寸图片幻灯片

图 9 部分实验场景轨迹示意图

Fig. 9 Trajectories diagram in some experimental scenes

下载: 全尺寸图片幻灯片

图 10 超参数消融实验结果

Fig. 10 Results of the ablation experiments on hyper-parameters

下载: 全尺寸图片幻灯片

表 1 数学符号说明

Table 1 Explanation of mathematical symbols

数学符号	符号说明
$ \mathcal{I}=\{1,\; 2,\; \cdots,\; N\} $	无人机集合
$ s_t $	时刻 $ t $ 的系统状态
$ o_t^i $	无人机 $ i $ 在时刻 $ t $ 的局部观测
$ a_t^i $	无人机 $ i $ 在时刻 $ t $ 的动作
$ \pi^i $	无人机 $ i $ 的策略
$ \boldsymbol{a}_t=[a_t^1,\; \cdots,\; a_t^N] $	联合动作
$ \boldsymbol{a}_t \setminus a_t^i $	去除无人机 $ i $ 动作后的联合动作
$ P(s_{t+1} \| s_t,\; \boldsymbol{a}_t) $	状态转移概率
$ \gamma $	折扣因子
$ C^i(s_t,\; s_{t+1}) $	无人机 $ i $ 的因果影响度量
$ D_\mathrm{KL}(\cdot \\| \cdot) $	KL散度
$ \beta $	$ \beta {\text{-}} $VAE正则系数
$ r^i_t $	无人机 $ i $ 在时刻 $ t $ 的奖励
$ r^i_{\mathrm{cid},\; t} $	无人机 $ i $ 在时刻 $ t $ 的因果内在奖励
$ \lambda $	内在奖励权重

下载: 导出CSV

表 2 实验超参数设置

Table 2 Experimental hyper-parameters setting

参数名称	取值
学习率$ \alpha $	0.000 5
VAE模块学习率$ \alpha_\mathrm{vae} $	0.000 5
折扣因子$ \gamma $	0.99
裁剪系数$ \varepsilon $	0.2
激活函数	ReLU
批量数据大小	1 024
经验回放池$ \mathcal{D} $大小	3 200
奖励权重$ \lambda $	1.0
VAE参数$ \beta $	0
蒙特卡洛采样数$ K $	128
actor网络全连接层节点数	[64, 64, 64]
actor网络RNN隐层节点数	64
critic网络全连接层节点数	[64, 64]
critic网络RNN隐层节点数	64
$ \beta {\text{-}}$VAE编码器隐层节点数 (状态编码)	[256, 128, 64]
$ \beta {\text{-}}$VAE编码器隐层节点数 (动作编码)	[64, 128]
$ \beta {\text{-}} $VAE解码器隐层节点数	[64, 128, 256]

下载: 导出CSV

表 3 各算法在不同实验配置下的累积回报统计结果

Table 3 Statistical results of cumulative rewards for different algorithms under various experimental configurations

场景	CID-MAPPO	MAPPO	VDN	QMIX	MADDPG	PMIC
场景1下3架无人机	−131.89 ± 22.66	−155.67 ± 15.71	−225.21 ± 39.60	−194.05 ± 5.96	−225.02 ± 9.64	−147.94 ± 17.65
场景1下5架无人机	−342.43 ± 26.52	−377.48 ± 16.83	−442.86 ± 11.65	−548.82 ± 116.16	−425.96 ± 6.70	−371.49 ± 17.98
场景1下7架无人机	−590.88 ± 43.33	−616.58 ± 35.03	−673.14 ± 211.54	−916.50 ± 239.70	−673.55 ± 28.43	−610.09 ± 26.67
场景2下3架无人机	89.79 ± 10.45	65.35 ± 12.62	−43.16 ± 16.48	7.42 ± 32.91	−75.18 ± 29.76	74.46 ± 18.73
场景2下5架无人机	156.93 ± 16.15	128.10 ± 15.49	−117.25 ± 23.16	−49.41 ± 29.84	−120.12 ± 33.18	125.02 ± 21.03
场景2下7架无人机	259.88 ± 41.40	194.71 ± 23.77	−176.29 ± 26.21	−180.94 ± 31.54	−173.93 ± 24.84	228.79 ± 16.97
场景3下3架无人机	149.63 ± 12.43	71.39 ± 63.32	−73.31 ± 14.59	−55.00 ± 21.38	−106.13 ± 33.02	89.83 ± 48.67
场景3下5架无人机	279.15 ± 11.32	232.04 ± 23.27	−70.00 ± 29.73	55.69 ± 35.57	−141.55 ± 14.94	209.06 ± 28.05
场景3下7架无人机	332.24 ± 12.04	92.45 ± 41.34	−58.83 ± 35.12	−276.00 ± 76.42	−197.56 ± 27.56	115.33 ± 34.52

下载: 导出CSV

参考文献(35)

[1]	Chang Y Y, John A, Hsiung P A. Maritime UAV patrol tasks based on YOLOv4 object detection. In: Proceedings of the 2022 International Conference on Computational Science and Computational Intelligence. Las Vegas, USA: IEEE, 2022. 1484–1490
[2]	Xu P F, Fang Y, Jiang Q Y, Lu H F, Li G X, Zhou H. Design and research of a water quality monitoring system for aquaculture using UAV integrated with 3D GIS. Advances in Transdisciplinary Engineering, DOI: 10.3233/atde241331
[3]	Sendner F. An energy-autonomous UAV swarm concept to support sea-rescue and maritime patrol missions in the Mediterranean Sea. Aircraft Engineering and Aerospace Technology, 2022, 94(1): 112−123 doi: 10.1108/AEAT-12-2020-0316
[4]	Nomikos N, Gkonis P K, Bithas P S, Trakadas P. A survey on UAV-aided maritime communications: Deployment considerations, applications, and future challenges. IEEE Open Journal of the Communications Society, 2022, 4: 56−78 doi: 10.1109/ojcoms.2022.3225590
[5]	Liu H D, Long X L, Li Y, Yan J J, Li M Y, Chen C, et al. Adaptive multi-UAV cooperative path planning based on novel rotation artificial potential fields. Knowledge-Based Systems, 2025, 317: Article No. 113429 doi: 10.1016/j.knosys.2025.113429
[6]	Zhao H M, Gu M X, Qiu S P, Zhao A, Deng W. Dynamic path planning for space-time optimization cooperative tasks of multiple unmanned aerial vehicles in uncertain environment. IEEE Transactions on Consumer Electronics, 2025, 71(3): 7673−7682 doi: 10.1109/TCE.2025.3593383
[7]	Hu W J, Yu Y, Liu S M, She C Y, Guo L, Vucetic B. Multi-UAV coverage path planning: A distributed online cooperation method. IEEE Transactions on Vehicular Technology, 2023, 72(9): 11727−11740 doi: 10.1109/TVT.2023.3266817
[8]	Sil M, Rakshit P, Chatterjee S, Ghosh R A, Chowdhury A. Multi-UAV cooperative path-planning in complex terrain: A multi-objective optimization approach. IETE Journal of Research, 2025, 71(8): Article No. 2606 doi: 10.1080/03772063.2025.2497515
[9]	Sun W M, Hao M R. A survey of cooperative path planning for multiple UAVs. In: Proceedings of the 2021 International Conference on Autonomous Unmanned Systems. Singapore: Springer, 2021. 189–196
[10]	Liu W, Cai W Z, Jiang K, Cheng G R, Wang Y D, Wang J W, et al. Xuance: A comprehensive and unified deep reinforcement learning library. arXiv preprint arXiv: 2312.16248, 2023.
[11]	Alexandros T, Georgios D. A comprehensive survey on the applications of swarm intelligence and bio-inspired evolutionary strategies. Machine Learning Paradigms: Advances in Deep Learning-based Technological Applications, DOI: 10.1007/978-3-030-49724-8_15
[12]	Dario F, Claudio M. Bio-inspired Artificial Intelligence: Theories, Methods, and Technologies. Cambridge: MIT press, 2008.
[13]	Haldorai A, Kandaswamy U. A bio-inspired swarm intelligence technique for social aware cognitive radio handovers. Computers & Electrical Engineering, 2018, 71: 925−937 doi: 10.1016/j.compeleceng.2017.09.016
[14]	孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题. 自动化学报, 2020, 46(7): 1301−1312 doi: 10.16383/j.aas.c200159 Sun Chang-Yin, Mu Chao-Xu. Important scientific problems of multi-agent deep reinforcement learning. Acta Automatica Sinica, 2020, 46(7): 1301−1312 doi: 10.16383/j.aas.c200159
[15]	罗彪, 胡天萌, 周育豪, 黄廷文, 阳春华, 桂卫华. 多智能体强化学习控制与决策研究综述. 自动化学报, 2025, 51(3): 510−539 doi: 10.16383/j.aas.c240392 Luo Biao, Hu Tian-Meng, Zhou Yu-Hao, Huang Ting-Wen, Yang Chun-Hua, Gui Wei-Hua. Survey on multi-agent reinforcement learning for control and decision-making. Acta Automatica Sinica, 2025, 51(3): 510−539 doi: 10.16383/j.aas.c240392
[16]	陈凯, 雷一辰, 李琰泽, 方国宇, 胡子卓, 杨明实, 等. 基于改进MADDPG的多目标航迹规划方法. 北京航空航天大学学报, DOI: 10.13700/j.bh.1001-5965.2025.0636 Chen Kai, Lei Yi-Chen, Li Yan-Ze, Fang Guo-Yu, Hu Zi-Zhuo, Yang Ming-Shi, et al. Multi-object trajectory planning method based on improved MADDPG. Journal of Beijing University of Aeronautics and Astronautics, DOI: 10.13700/j.bh.1001-5965.2025.0636
[17]	Sunehag P, Lever G, Gruslys A, Czarnecki W M, Zambaldi V, Jaderberg M, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv: 1706.05296, 2017.
[18]	Rashid T, Samvelyan M, de Witt C S, Farquhar G, Foerster J, Whiteson S. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 2020, 21(178): 1−51
[19]	Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative competitive environments. Advances in Neural Information Processing Systems, DOI: 10.48550/arXiv.1706.02275
[20]	Yu C, Velu A, Vinitsky E, Gao J X, Wang Y, Bayen A, et al. The surprising effectiveness of PPO in cooperative multi-agent games. Advances in Neural Information Processing Systems, 2022, 35: 24611−24624 doi: 10.52202/068431-1787
[21]	Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. In: Proceedings of the 2018 AAAI Conference on Artificial Intelligence. Louisiana, USA: AAAI Press, 2018. 2974–2982
[22]	Rashid T, Samvelyan M, de Witt C S, Farquhar G, Foerster J, Whiteson S. Maritime search and rescue based on group mobile computing for unmanned aerial vehicles and unmanned surface vehicles. IEEE Transactions on Industrial Informatics, 2020, 16(12): 7700−7708 doi: 10.1109/TII.2020.2974047
[23]	Lei C J, Wu S H, Yang Y, Xue J Y, Zhang Q Y. Maritime search and rescue leveraging heterogeneous units: A multi agent reinforcement learning approach. In: Proceedings of the 12th IEEE/CIC International Conference on Communications. Dalian, China: IEEE, 2018. 1–6
[24]	Lei C J, Wu S H, Yang Y, Xue J Y, Zhang Q Y. Joint trajectory and communication optimization for heterogeneous vehicles in maritime sar: Multi-agent reinforcement learning. IEEE Transactions on Vehicular Technology, 2024, 73(9): 12328−12344 doi: 10.1109/TVT.2024.3388499
[25]	Wu X, Yan Q Z, Wang J C, Zhou Y H, Huang Q L, Jiang C H. Dynamic task allocation for UAV swarms in maritime rescue scenarios based on PG-MAPPO. IEEE Internet of Things Journal, 2025, 12(18): 38073−38087 doi: 10.1109/JIOT.2025.3584767
[26]	Luo Q Y, Luan T H, Shi W S, Fan P Z. Deep reinforcement learning based computation offloading and trajectory planning for multi-UAV cooperative target search. IEEE Journal on Selected Areas in Communications, 2022, 41(2): 504−520 doi: 10.23919/ccc64809.2025.11179539
[27]	Hou Y K, Zhao J, Zhang R Q, Cheng X, Yang L Q. UAV swarm cooperative target search: A multi-agent reinforcement learning approach. IEEE Transactions on Intelligent Vehicles, 2022, 9(1): 568−578
[28]	Panait L, Luke S. Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 2005, 11(3): 387−434 doi: 10.1007/s10458-005-2631-2
[29]	Liu M H, Zhou M, Zhang W N, Zhuang Y Z, Wang J, Liu W L, et al. Multi-agent interactions modeling with correlated policies. arXiv preprint arXiv: 2001.03415, 2020.
[30]	Du X, Ye Y T, Zhang P Y, Yang Y N, Chen M S, Wang T. Situation-dependent causal influence-based cooperative multi-agent reinforcement learning. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI, 2024. 17362–17370
[31]	Li P Y, Tang H Y, Yang T P, Hao X T, Sang T, Zheng Y, et al. PMIC: Improving multi-agent reinforcement learning with progressive mutual information collaboration. In: Proceedings of the 39th International Conference on Machine Learning. Seoul, South Korea: PMLR, 2022. 12979–12997
[32]	Kim W J, Jung W Y, Cho M S, Sung Y C. A variational approach to mutual information-based coordination for multi-agent reinforcement learning. arXiv preprint arXiv: 2303.00451, 2023.
[33]	Kim W J, Jung W Y, Cho M S, Sung Y C. Signal instructed coordination in cooperative multi-agent reinforcement learning. arXiv preprint arXiv: 1909.04224, 2019.
[34]	Seitzer M, Schölkopf B, Martius G. Causal influence detection for improving efficiency in reinforcement learning. In: Proceedings of the 2021 Advances in Neural Information Processing Systems. Virtual Event: NeurlPS, 2021. 22905–22918
[35]	Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, et al. Beta-VAE: Learning basic visual concepts with a constrained variational framework. In: Proceedings of the 2017 International Conference on Learning Representations. Toulon, France: ICLR, 2017. 60–81