含动力学奖励的航天器编队深度强化学习控制

金伟成; 陈提; 胡海岩

doi:10.16383/j.aas.c250202

含动力学奖励的航天器编队深度强化学习控制

doi: 10.16383/j.aas.c250202 cstr: 32138.14.j.aas.c250202

金伟成^{1, 2,},
陈提^{1, 2,},
胡海岩^{1, 2,}

1.
南京航空航天大学航空学院南京 210016
2.
航空航天结构力学及控制全国重点实验室南京 210016

基金项目: 国家重点研发计划(2022YFC2204800), 国家自然科学基金(12494562, 12472015)资助

详细信息

作者简介:
金伟成：南京航空航天大学航空学院博士研究生. 2021年获得南京航空航天大学工程力学专业学士学位. 主要研究方向为航天器集群的导航、动力学与控制, 马尔科夫过程, 分布式控制系统. E-mail: jinweich@nuaa.edu.cn

陈提：南京航空航天大学教授. 2017年获得南京航空航天大学动力学与控制专业博士学位. 主要研究方向为在轨自主组装, 绳系卫星, 复杂结构的动力学与控制. E-mail: chenti@nuaa.edu.cn

胡海岩：南京航空航天大学教授. 1988年获得南京航空航天大学固体力学专业博士学位.主要研究方向为柔性结构的时滞控制, 飞行器结构颤振主动抑制, 空间结构展开动力学. 本文通信作者. E-mail: hhyae@nuaa.edu.cn

计量
- 文章访问数: 273
- HTML全文浏览量: 97
- PDF下载量: 78
- 被引次数: 0
出版历程
- 收稿日期: 2025-05-07
- 录用日期: 2025-08-28
- 网络出版日期: 2025-09-19
- 刊出日期: 2025-10-20

Deep Reinforcement Learning Control for Spacecraft Formation With Dynamical Reward

JIN Wei-Cheng^{1, 2
,},
CHEN Ti^{1, 2
,},
HU Hai-Yan^{1, 2
,}

1.
College of Aerospace Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016
2.
State Key Laboratory of Mechanics and Control for Aerospace Structures, Nanjing 210016

Funds: Supported by National Key Research and Development Program of China (2022YFC2204800) and National Natural Science Foundation of China (12494562, 12472015)

More Information

Author Bio:
JIN Wei-Cheng　Ph.D. candidate at the College of Aerospace Engineering, Nanjing University of Aeronautics and Astronautics. He received his bachelor degree in engineering mechanics from Nanjing University of Aeronautics and Astronautics in 2021. His research interest covers guidance, dynamics and control for spacecraft swarm, Markov process, and distributed control system

CHEN Ti　Professor at Nanjing University of Aeronautics and Astronautics. He received his Ph.D. degree in dynamics and control from Nanjing University of Aeronautics and Astronautics in 2017. His research interest covers in-orbit autonomous assembly, tethered satellite, and complex structure dynamics and control

HU Hai-Yan　 Professor at Nanjing University of Aeronautics and Astronautics. He received his Ph.D. degree in solid mechanics from Nanjing University of Aeronautics and Astronautics in 1988. His research interest covers the delayed control of flexible structures, the active flutter suppression of aircraft structures, and the deployment dynamics of space structures. Corresponding author of this paper

摘要

摘要: 提出一种航天器编队的深度强化学习控制方法. 该方法通过引入动力学奖励, 考虑轨迹的动力学可行性并优化燃料消耗量. 在训练环境中, 引入$J_{2}$摄动相对动力学模型, 基于近端策略优化算法, 将航天器的局部观测信息作为策略网络和评价网络的输入. 策略网络输出航天器的期望位置和速度, 结合动力学模型限制策略任意动作之间的转换控制, 使输出轨迹考虑动力学可行性. 评价网络基于局部观测信息估计由动力学模型限制的优势函数, 从而辅助策略网络更新参数. 进一步地, 以燃料消耗量的负数作为动力学奖励, 结合避撞和任务相关奖励后, 训练得到的策略网络在完成航天器编队任务的同时优化了燃料消耗.
- 深度强化学习 /
- 航天器编队 /
- 分布式控制 /
- 动力学
Abstract: This paper presents a deep reinforcement learning control method for spacecraft formation. The method deals with the dynamical feasibility of the trajectory and optimizes the fuel consumption by introducing dynamical reward. Based on proximal policy optimization algorithm, a dynamic model of relative motion with $J_{2}$ perturbation is introduced in the training environment, and the inputs of Actor and Critic networks are the local observed information of the spacecraft. The outputs of the Actor network are the desired position and velocity of the spacecraft. Combining the dynamic model that restricts the control of transitions between two arbitrary actions of the strategy, the Actor network outputs the desired position and velocity, which makes the output trajectory account for the dynamical feasibility. The Critic network estimates the advantage function constrained by the dynamic model based on local observed information, therefore, the Actor network updates the parameters based on the advantage function. Further, the dynamical reward is defined as the negative value of the fuel consumption. As a result, combining collision avoidance and task-related rewards, the obtained Actor network achieves the distributed spacecraft formation task while optimizing the fuel consumption.
- Deep reinforcement learning /
- spacecraft formation /
- distributed control /
- dynamics

HTML全文

图 1 航天器自主编队示意图

Fig. 1 Schematic diagram of spacecraft autonomous formation

下载: 全尺寸图片幻灯片

图 2 部分可观马尔科夫博弈

Fig. 2 Partially observable Markov games

下载: 全尺寸图片幻灯片

图 3 所提方法流程图

Fig. 3 Flowchart of the proposed method

下载: 全尺寸图片幻灯片

图 4 训练环境渲染图

Fig. 4 Rendering graph of the training environment

下载: 全尺寸图片幻灯片

图 5 训练架构示意图

Fig. 5 Schematic diagram of the training framework

下载: 全尺寸图片幻灯片

图 6 局部观测信息

Fig. 6 Local observed information

下载: 全尺寸图片幻灯片

图 7 有无动力学对比图

Fig. 7 Comparison graph with and without dynamic

下载: 全尺寸图片幻灯片

图 8 平均奖励随训练轮数变化图

Fig. 8 Graph of average reward over training epoch

下载: 全尺寸图片幻灯片

图 9 有无动力学奖励的轨迹对比

Fig. 9 Comparison of trajectories with or without dynamical reward

下载: 全尺寸图片幻灯片

图 10 策略在不同场景下的渲染图

Fig. 10 Rendering graphs of strategy in different scenarios

下载: 全尺寸图片幻灯片

表 1 部分超参数取值

Table 1 The values of some hyperparameters

参数符号	含义	取值
$p$	并行训练环境数	128
$\sigma $	策略信息熵加权系数	$\{3\times10^{-3},\;3\times10^{-5}\}$
$\varepsilon,\; \delta $	裁剪超参数	0.2
$\gamma $	折扣因子	0.99
${{u}_{\min }}$	控制加速度下限	$-$0.001 m/s²
${{u}_{\max }}$	控制加速度上限	0.001 m/s²
$\dim({\boldsymbol{l}})$	低功耗雷达检测数据数	90
${{r}_{\text{b}}}$	与边界碰撞的奖励	$-$500
${{r}_{\text{inter}}}$	相互碰撞的奖励	$-$10
${{r}_{\text{base}}},\; {{r}_{\text{inc}}}$	期望奖励相关超参数	1, 35
${{\alpha }_{1}}$	期望奖励加权系数	0.6
${{l}_{r}}$	学习率	$\{2\times10^{-4},\; 0\}$
$a$	半长轴	7 100 km
$\omega $	近地点幅角	$-{{20}^{\circ }}$
$f$	真近点角	${{20}^{\circ }}$
$\Omega $	升交点经度	$0^{\circ }$
$e$	离心率	0.05
$i_{o}$	轨道倾角	${{15}^{\circ }}$
$k_{p},\; k_{i},\; k_{d}$	PID跟踪器参数	1.0, 0.01, 2.0

下载: 导出CSV

表 2 500轮后不同动力学奖励占比的结果

Table 2 Results under different percentages of dynamical reward after 500 epochs

$\alpha_{2} $	完成率(%)	平均完成步数	燃料消耗率(%)	燃料消耗绝对值
-	91.5	142.95
0	84.5	123.95	61.21	75.88
0.5	69.5	125.42	58.17	72.96
1.0	64.0	127.79	46.83	59.84
1.5	0

下载: 导出CSV

参考文献(29)

[1]	Xue Z H, Liu J G, Wu C C, Tong Y C. Review of in-space assembly technologies. Chinese Journal of Aeronautics, 2021, 34(11): 21−47 doi: 10.1016/j.cja.2020.09.043
[2]	马亚杰, 姜斌, 任好. 航天器位姿运动一体化直接自适应容错控制研究. 自动化学报, 2023, 49(3): 678−686 Ma Ya-Jie, Jiang Bin, Ren Hao. Adaptive direct fault-tolerant control design for spacecraft integrated attitude and orbit system. Acta Automatica Sinica, 2023, 49(3): 678−686
[3]	Lymer J, Hanson M, Tadros A, Boccio J, Hollenstein B, Emerick K, et al. Commercial application of in-space assembly. In: Proceedings of the AIAA SPACE. Long Beach, USA: AIAA, 2016. 5236−5253
[4]	Bartlett R O. NASA standard multimission modular spacecraft for future space exploration. In: Proceedings of the 16th Goddard Memorial Symposium. Washington, USA: American Astronautical Society and Deutsche Gesellschaft für Luft- und Raumfahrt, 1978.
[5]	韦正涛. 模块化航天器自主组装控制及地面实验研究 [博士学位论文], 南京航空航天大学, 中国, 2023. Wei Zheng-Tao. Autonomous Assembly Control and Ground Experiment of Modular Spacecraft [Ph.D. dissertation], Nanjing University of Aeronautics and Astronautics, China, 2023.
[6]	Dennison K, Stacey N, D'Amico S. Autonomous asteroid characterization through nanosatellite swarming. IEEE Transactions on Aerospace and Electronic Systems, 2023, 59(4): 4604−4624 doi: 10.1109/TAES.2023.3245997
[7]	Zheng S K, Li T J, Zhao J, Ma X F, Zhu J L, Huang Z R, et al. Deployment impact experiment and dynamic analysis of modular truss antenna. International Journal of Aerospace Engineering, 2022: Article No. 2038932
[8]	Burns R, McLaughlin C A, Leitner J, Martin M. TechSat 21: Formation design, control, and simulation. In: Proceedings of the IEEE Aerospace Conference. Proceedings (Cat. No.00TH8484). Big Sky, USA: IEEE, 2000. 19−25
[9]	Bonin G, Roth N, Armitage S, Newman J, Risi B, Zee R E. CanX-4 and CanX-5 precision formation flight: Mission accomplished! In: Proceedings of the 29th Annual AIAA/USU Conference on Small Satellites. 2015.
[10]	郑重, 李鹏, 钱默抒. 具有角速度和输入约束的航天器姿态协同控制. 自动化学报, 2021, 47(6): 1444−1452 Zheng Zhong, Li Peng, Qian Mo-Shu. Spacecraft attitude coordination control with angular velocity and input constraints. Acta Automatica Sinica, 2021, 47(6): 1444−1452
[11]	Foust R C, Lupu E S, Nakka Y K, Chung S J, Hadaegh F Y. Autonomous in-orbit satellite assembly from a modular heterogeneous swarm. Acta Astronautica, 2020, 169: 191−205 doi: 10.1016/j.actaastro.2020.01.006
[12]	Camacho E F, Bordons C. Model Predictive Control. London: Springer, 2007.
[13]	Ortolano N, Geller D K, Avery A. Autonomous optimal trajectory planning for orbital rendezvous, satellite inspection, and final approach based on convex optimization. The Journal of the Astronautical Sciences, 2021, 68(2): 444−479 doi: 10.1007/s40295-021-00260-5
[14]	Basu H, Pedari Y, Almassalkhi M, Ossareh H R. Computationally efficient collision-free trajectory planning of satellite swarms under unmodeled orbital perturbations. Journal of Guidance, Control, and Dynamics, 2023, 46(8): 1548−1563
[15]	于杰. 基于深度强化学习的多智能体协同包围算法研究 [硕士学位论文], 哈尔滨理工大学, 中国, 2024. Yu Jie. Research on Multi-agent Cooperative Encirclement Algorithm Based on Deep Reinforcement Learning [Master thesis], Harbin University of Science and Technology, China, 2024.
[16]	王龙, 黄锋. 多智能体博弈、学习与控制. 自动化学报, 2023, 49(3): 580−613 Wang Long, Huang Feng. An interdisciplinary survey of multi-agent games, learning, and control. Acta Automatica Sinica, 2023, 49(3): 580−613
[17]	赵春宇, 赖俊. 元强化学习综述. 计算机应用研究, 2023, 40(1): 1−10 Zhao Chun-Yu, Lai Jun. Survey on meta reinforcement learning. Application Research of Computers, 2023, 40(1): 1−10
[18]	Luo B, Wu H N, Huang T W. Off-policy reinforcement learning for ${H_{\infty}}$ control design. IEEE Transactions on Cybernetics, 2015, 45(1): 65−76 doi: 10.1109/TCYB.2014.2319577
[19]	Vamvoudakis K G, Lewis F L. Online solution of nonlinear two-player zero-sum games using synchronous policy iteration. International Journal of Robust and Nonlinear Control, 2012, 22(13): 1460−1483 doi: 10.1002/rnc.1760
[20]	Yan C, Xiang X J, Wang C. Fixed-wing UAVs flocking in continuous spaces: A deep reinforcement learning approach. Robotics and Autonomous Systems, 2020, 131: Article No. 103594 doi: 10.1016/j.robot.2020.103594
[21]	Xu D, Guo Y X, Yu Z Y, Wang Z F, Lan R Z, Zhao R H, et al. PPO-exp: Keeping fixed-wing UAV formation with deep reinforcement learning. Drones, 2023, 7(1): Article No. 28
[22]	Chen Y F, Liu M, Everett M, How J P. Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Singapore: IEEE, 2017. 285−292
[23]	胡成. 基于深度强化学习的多智能体协作关键技术研究 [博士学位论文], 北京邮电大学, 中国, 2025. Hu Cheng. Research on Key Technologies of Multi-agent Collaboration Based on Deep Reinforcement Learning [Ph.D. dissertation], Beijing University of Posts and Telecommunications, China, 2025.
[24]	Wang D W, Wu B L, Poh E K. Satellite Formation Flying. Singapore: Springer, 2017.
[25]	Hansen E A, Bernstein D S, Zilberstein S. Dynamic programming for partially observable stochastic games. In: Proceedings of the 19th National Conference on Artifical Intelligence. San Jose, USA: AAAI, 2004. 709−715
[26]	温广辉, 杨涛, 周佳玲, 付俊杰, 徐磊. 强化学习与自适应动态规划: 从基础理论到多智能体系统中的应用进展综述. 控制与决策, 2023, 38(5): 1200−1230 Wen Guang-Hui, Yang Tao, Zhou Jia-Ling, Fu Jun-Jie, Xu Lei. Reinforcement learning and adaptive/approximate dynamic programming: A survey from theory to applications in multi-agent systems. Control and Decision, 2023, 38(5): 1200−1230
[27]	Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv preprint arXiv: 1707.06347, 2017.
[28]	Yu C, Velu A, Vinitsky E, Gao J X, Wang Y, Bayen A, et al. The surprising effectiveness of PPO in cooperative multi-agent games. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. Article No. 1787
[29]	Zhao S Y. Mathematical Foundations of Reinforcement Learning. Singapore: Springer, 2025.