2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于优势函数输入扰动的多无人艇协同策略优化方法

任璐 柯亚男 柳文章 穆朝絮 孙长银

任璐, 柯亚男, 柳文章, 穆朝絮, 孙长银. 基于优势函数输入扰动的多无人艇协同策略优化方法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240453
引用本文: 任璐, 柯亚男, 柳文章, 穆朝絮, 孙长银. 基于优势函数输入扰动的多无人艇协同策略优化方法. 自动化学报, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240453
Ren Lu, Ke Ya-Nan, Liu Wen-Zhang, Mu Chao-Xu, Sun Chang-Yin. Multi-usvs cooperative policy optimization method based on disturbed input of advantage function. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240453
Citation: Ren Lu, Ke Ya-Nan, Liu Wen-Zhang, Mu Chao-Xu, Sun Chang-Yin. Multi-usvs cooperative policy optimization method based on disturbed input of advantage function. Acta Automatica Sinica, xxxx, xx(x): x−xx doi: 10.16383/j.aas.c240453

基于优势函数输入扰动的多无人艇协同策略优化方法

doi: 10.16383/j.aas.c240453 cstr: 32138.14.j.aas.c240453
基金项目: 国家自然科学基金(62303009)资助
详细信息
    作者简介:

    任璐:安徽大学人工智能学院讲师. 2021年获得东南大学控制科学与工程专业博士学位. 主要研究方向为多智能体系统一致性控制, 深度强化学习和多智能体强化学习. E-mail: penny_lu@ia.ac.cn

    柯亚男:安徽大学人工智能学院硕士研究生. 主要研究方向为多智能体强化学习, 船舶运动控制. E-mail: yanan_ke@stu.ahu.edu.cn

    柳文章:安徽大学人工智能学院讲师. 2021年获得东南大学控制科学与工程专业博士学位. 主要研究方向为深度强化学习, 多智能体强化学习, 迁移强化学习和机器人. 本文通信作者. E-mail: wzliu@ahu.edu.cn

    穆朝絮:天津大学电气自动化与信息工程学院教授. 主要研究方向为强化学习, 自适应学习系统和智能控制与优化. E-mail: cxmu@tju.edu.cn

    孙长银:安徽大学自动化学院教授. 1996年获得四川大学应用数学专业学士学位, 分别于2001年、2004年获得东南大学电子工程专业硕士和博士学位. 主要研究方向为智能控制, 飞行器控制, 模式识别和优化理论. E-mail: 20168@ahu.edu.cn

Multi-USVs Cooperative Policy Optimization Method Based on Disturbed Input of Advantage Function

Funds: Supported by National Natural Science Foundation of China (62303009)
More Information
    Author Bio:

    REN Lu Lecturer at the School of Artificial Intelligence, Anhui University. She received her Ph.D. degree in control science and engineering from Southeast University in 2021. Her research interest covers multi-agent system consensus control, deep reinforcement learning, and multi-agent reinforcement learning

    KE Ya-Nan Master student at the School of Artificial Intelligence, Anhui University. Her research interest covers multi-agent reinforcement learning and ship motion control

    LIU Wen-Zhang Lecturer at the School of Artificial Intelligence, Anhui University. He received his Ph.D. degree in control science and engineering from, Southeast University, in 2021. His research interest covers deep reinforcement learning, multi-agent reinforcement learning, transfer reinforcement learning, and robotics. Corresponding author of this paper

    MU Chao-Xu Professor at the School of Electrical Automation and Information Engineering, Tianjin University. Her research interest covers reinforcement learning, adaptive 1earning systems, intelligent control and optimization

    SUN Chang-Yin Professor at the School of Automation, Anhui University. He received his bachelor degree in applied mathematics from Sichuan University in 1996, and his master and Ph.D. degrees in electric engineering from Southeast University in 2001 and 2004, respectively. His research interest covers intelligent control, aircraft control, pattern recognition, and optimization theory

  • 摘要: 多无人艇协同导航对于实现高效海上作业至关重要, 而如何在开放未知海域处理多艇之间复杂的协作关系、实现多艇自主协同决策是当前亟待解决的难题. 近年来, 多智能体强化学习 (Multi-agent reinforcement learning, MARL) 在解决复杂的多体决策问题上展现出巨大的潜力, 被广泛应用于多无人艇协同导航任务中. 然而, 这种基于数据驱动的方法通常存在探索效率低、探索与利用难平衡、易陷入局部最优等问题. 因此, 在集中训练和分散执行 (Centralized training and decentralized execution, CTDE) 框架的基础上, 考虑从优势函数输入端注入扰动量来提升优势函数的泛化能力, 提出一种新的基于优势函数输入扰动的多智能体近端策略优化(Noise-advantage multi-agent proximal policy optimization, NA-MAPPO)方法, 从而提升多无人艇协同策略的探索效率. 实验结果表明, 与现有的基准算法相比, 所提方法能够有效提升多无人艇协同导航任务的成功率, 缩短策略的训练时间以及任务的完成时间, 从而提升多无人艇协同探索效率, 避免策略陷入局部最优.
  • 图  1  多无人艇协同导航示意图

    Fig.  1  Diagram of multi-USVs cooperative navigation

    图  2  无人艇的惯性坐标系和体固定坐标系

    Fig.  2  The body-fixed coordinate system and inertial coordinate system of USV

    图  3  奖励函数示意图

    Fig.  3  Diagram of reward function

    图  4  NA-MAPPO示意图, 灰色部分为分散执行部分, 蓝色部分为集中训练部分

    Fig.  4  Diagram of NA-MAPPO, the gray section represents the decentralized execution part, while the blue section represents the centralized training part

    图  5  经验共享机制示意图

    Fig.  5  Diagram of Experience Sharing Mechanism

    图  6  实验场景示意图.

    Fig.  6  Diagram of the experimental scenes.

    图  7  不同场景下各算法学习曲线

    Fig.  7  Learning curves of various algorithms under different scenes

    图  8  不同场景下的导航成功率

    Fig.  8  Navigation success rates under different scenes

    图  9  不同场景下的导航时间

    Fig.  9  Navigation time under different scenes

    表  2  导航成功率、导航时间、累计回合奖励对比

    Table  2  Comparison of navigation success rate, navigation time, and cumulative episode reward

    导航成功率 (%)
    MAPPO WN-MAPPO OU-MAPPO NA-MAPPO NA-WN-MAPPO NA-OU-MAPPO
    场景1 45 75 90 95 98 98
    场景2 72 70 88 90 93 95
    场景3 40 70 84 88 90 93
    场景4 35 60 80 86 90 92
    导航时间 (s)
    MAPPO WN-MAPPO OU-MAPPO NA-MAPPO NA-WN-MAPPO NA-OU-MAPPO
    场景1 34 30 25 22 20 20
    场景2 33 33 32 32 28 22
    场景3 42 36 34 32 30 25
    场景4 45 38 36 34 34 26
    累积回合奖励
    MAPPO WN-MAPPO OU-MAPPO NA-MAPPO NA-WN-MAPPO NA-OU-MAPPO
    场景1 42.93 45.45 52.50 68.50 70 72.6
    场景2 70.00 65.80 78.02 90.50 100 100.0
    场景3 109.05 120.85 136.47 143.33 151 154.0
    场景4 134.05 143.36 153.09 190.66 200 200.0
    下载: 导出CSV

    表  3  实验超参数设置

    Table  3  Experimental hyperparameter settings

    参数
    折扣因子$ \gamma $ 0.9
    Critic网络的学习率$ {\alpha _w} $ 0.001
    Actor网络的学习率$ {\alpha _u} $ 0.0001
    目标Critic网络的学习率$ {\alpha _{w'}} $ 0.001
    批量$ N $ 1024
    缓冲器尺寸$ M $ 10000
    软更新因子$ \tau $ 0.001
    隐藏层1神经元数 128
    隐藏层2神经元数 128
    下载: 导出CSV
  • [1] Li J Q, Zhang G Q, Zhang X K, Zhang W D. Integrating dynamic event-triggered and sensor-tolerant control: Application to USV-UAVs cooperative formation system for maritime parallel search. IEEE Transactions on Intelligent Transportation Systems, 2024, 25(5): 3986−3998 doi: 10.1109/TITS.2023.3326271
    [2] Madeo D, Pozzebon A, Mocenni C, Bertoni D. A low-cost unmanned surface vehicle for pervasive water quality monitoring. IEEE Transactions on Instrumentation and Measurement, 2020, 69(4): 1433−1444 doi: 10.1109/TIM.2019.2963515
    [3] Xie J J, Zhou R, Luo J, Peng Y, Liu Y, Xie S R, et al. Hybrid partition-based patrolling scheme for maritime area patrol with multiple cooperative unmanned surface vehicles. Journal of Marine Science and Engineering, 2020, 8(11): Article No. 936 doi: 10.3390/jmse8110936
    [4] Zuo Z Y, Liu C J, Han Q L, Song J W. Unmanned aerial vehicles: Control methods and future challenges. IEEE/CAA Journal of Automatica Sinica, 2022, 9(4): 601−614 doi: 10.1109/JAS.2022.105410
    [5] Zhang G Q, Han J, Li J Q, Zhang X K. APF-based intelligent navigation approach for USV in presence of mixed potential directions: Guidance and control design. Ocean Engineering, 2022, 260: Article No. 111972 doi: 10.1016/j.oceaneng.2022.111972
    [6] Wang D, Chen H M, Lao S H, Drew S. Efficient path planning and dynamic obstacle avoidance in edge for safe navigation of USV. IEEE Internet of Things Journal, 2024, 11(6): 10084−10094 doi: 10.1109/JIOT.2023.3325234
    [7] Zhang G Q, Shang X Y, Li J Q, Huang J S. A novel dynamic berthing scheme for an USV: DPFS guidance and two-dimensional event triggering ILC. IEEE Transactions on Intelligent Vehicles, DOI: 10.1109/TIV.2024.3402229
    [8] Wang C C, Wang Y L, Han Q L, Xie W B. Multi-USV cooperative formation control via deep reinforcement learning with deceleration. IEEE Transactions on Intelligent Vehicles, DOI: 10.1109/TIV.2024.3437735
    [9] Yu J B, Chen Z H, Zhao Z Y, Deng H, Wu J G, Xu J P. A cooperative hunting method for multi-USVs based on trajectory prediction by OR-LSTM. IEEE Transactions on Vehicular Technology, DOI: 10.1109/TVT.2024.3432739
    [10] Wang Y D, Cao J Y, Sun J, Zou X S, Sun C Y. Path following control for unmanned surface vehicles: A reinforcement learning-based method with experimental validation. IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2023.3313312.
    [11] Xi A, Mudiyanselage T W, Tao D C, Chen C. Balance control of a biped robot on a rotating platform based on efficient reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 2019, 6(4): 938−951 doi: 10.1109/JAS.2019.1911567
    [12] Yang H, Lin K, Xiao L, Zhao Y, Xiong Z, Han Z. Energy harvesting UAV-RIS-Assisted maritime communications based on deep reinforcement learning against jamming. IEEE Transactions on Wireless Communications, 2024, 23(8): 1−15 doi: 10.1109/TWC.2024.3417154
    [13] Wang C C, Wang Y L, Han Q L, Wu Y K. MUTS-based cooperative target stalking for a multi-USV system. IEEE/CAA Journal of Automatica Sinica, 2023, 10(7): 1582−1592 doi: 10.1109/JAS.2022.106007
    [14] Feng Z K, Huang M X, Wu D, Wu E Q, Yuen C. Multi-agent reinforcement learning with policy clipping and average evaluation for UAV-assisted communication Markov game. IEEE Transactions on Intelligent Transportation Systems, 2023, 24(12): 14281−14293 doi: 10.1109/TITS.2023.3296769
    [15] Wu Q C, Lin R, Ren Z W. Distributed multirobot path planning based on MRDWA-MADDPG. IEEE Sensors Journal, 2023, 23(20): 25420−25432 doi: 10.1109/JSEN.2023.3310519
    [16] Zhang H, Zhang X H, Feng Z, Xiao X H. Heterogeneous multi-robot cooperation with asynchronous multi-agent reinforcement learning. IEEE Robotics and Automation Letters, 2024, 9(1): 159−166 doi: 10.1109/LRA.2023.3328448
    [17] Sunehag P, Lever G, Gruslys A, Czarnecki W M, Zambaldi V, Jaderberg M, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv: 1706.05296, 2017.
    [18] Rashid T, Samvelyan M, de Witt C S, Farquhar G, Foerster J, Whiteson S. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv: 1803.11485, 2018.
    [19] Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. California, USA: Curran Associates Inc., 2017.
    [20] Yu C, Velu A, Vinitsky E, Wang Y, Bayen A, Wu Y. The surprising effectiveness of PPO in cooperative multi-agent games. arXiv preprint arXiv: 2103.01955, 2021.
    [21] Foerster J N, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual multi-agent policy gradients. arXiv preprint arXiv: 1705.08926, 2017.
    [22] Liu Y, Chen C, Qu D, Zhong Y X, Pu H Y, Luo J. Multi-USV system antidisturbance cooperative searching based on the reinforcement learning method. IEEE Journal of Oceanic Engineering, 2023, 48(4): 1019−1047 doi: 10.1109/JOE.2023.3281630
    [23] Wang C C, Wang Y L, Shi P, Wang F. Scalable-MADDPG-based cooperative target invasion for a multi-USV system. IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2023.3309689
    [24] Xia J W, Luo Y S, Liu Z K, Zhang Y L, Shi H R, Liu Z. Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning. Defence Technology, 2023, 29: 80−94 doi: 10.1016/j.dt.2022.09.014
    [25] Zhao Y J, Ma Y, Hu S L. USV formation and path-following control via deep reinforcement learning with random braking. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(12): 5468−5478 doi: 10.1109/TNNLS.2021.3068762
    [26] Li F B, Yin M M, Wang T D, Huang T W, Yang C H, Gui W H. Distributed pursuit-evasion game of limited perception USV swarm based on multi-agent proximal policy optimization. IEEE Tansactions on Systems, Man, and Cybernetics: Systems, 2024, 54(10): 6435−6446 doi: 10.1109/TSMC.2024.3429467
    [27] Gan W H, Qu X Q, Song D L, Yao P. Multi-USV cooperative chasing strategy based on obstacles assistance and deep reinforcement learning. IEEE Transactions on Automation Science and Engineering, 2024, 21(4): 5895−5910 doi: 10.1109/TASE.2023.3319510
    [28] Wang C C, Wang Y L, Han Q L, Wu Y K. MUTS-based cooperative target stalking for a multi-USV system. IEEE/CAA Journalof Automatica Sinica, 2023, 10(7): 1582−1592 doi: 10.1109/JAS.2022.106007
    [29] Eberhard O, Hollenstein J, Pinneri C, Martius G. Pink noise is all you need: Colored noise exploration in deep reinforcement learning. In: Proceedings of the 11th International Conference on Learning Representations. New Orleans, Louisiana, USA: OpenReview.net, 2019.
    [30] Raffin A, Hill A, Gleave A, Kanervisto A, Ernestus M, Dormann N. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 2021, 22(268): 1−8
    [31] Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen R Y, Chen X, et al. Parameter space noise for exploration. arXiv preprint arXiv: 1706.01905, 2018.
    [32] Hu J, Hu S Y, Liao S W. Policy regularization via noisy advantage values for cooperative multi-agent actor-critic methods. arXiv preprint arXiv: 2106.14334, 2023.
    [33] Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv: 1509.02971, 2019.
  • 加载中
计量
  • 文章访问数:  46
  • HTML全文浏览量:  32
  • 被引次数: 0
出版历程
  • 收稿日期:  2024-06-30
  • 录用日期:  2024-10-09
  • 网络出版日期:  2024-12-04

目录

    /

    返回文章
    返回