面向安全与性能动态平衡的自适应流形约束强化学习

刘正堂; 吴越; 胡春鹤; 卢仁智; 程骋; 郑一力

doi:10.16383/j.aas.c260122

面向安全与性能动态平衡的自适应流形约束强化学习

doi: 10.16383/j.aas.c260122 cstr: 32138.14.j.aas.c260122

刘正堂^{1, 2, 3,},
吴越^{1, 2, 3,},
胡春鹤^{1, 2, 3,},
卢仁智^4,,
程骋^4,,
郑一力^{1, 2, 3,}

1.
北京林业大学工学院北京 100083
2.
林木资源高效生产全国重点实验室北京 100083
3.
林业装备与自动化国家林业和草原局重点实验室北京 100083
4.
华中科技大学人工智能与自动化学院武汉 430074

基金项目: 国家自然科学基金(62273053)资助

详细信息

作者简介:
刘正堂：北京林业大学工学院硕士研究生.主要研究方向为安全强化学习和多机器人协同控制. E-mail: liuzhengtang@bjfu.edu.cn

吴越：北京林业大学工学院副教授. 主要研究方向为多机器人协同控制与强化学习. 本文通信作者. E-mail: wuyue_a@bjfu.edu.cn

胡春鹤：北京林业大学工学院副教授.主要研究方向为机器人安全强化学习、机器人自主控制等. E-mail: huchunhe@bjfu.edu.cn

卢仁智：华中科技大学人工智能与自动化学院副教授.主要研究方向为强化学习与无人系统. E-mail: rzlu@hust.edu.cn

程骋：华中科技大学人工智能与自动化学院副教授.主要研究方向为系统辨识与智能混线装配. E-mail: c_cheng@hust.edu.cn

郑一力：北京林业大学工学院教授.主要研究方向为智慧林业监测和机器人控制. E-mail: zhengyili@bjfu.edu.cn

计量
- 文章访问数: 7
- HTML全文浏览量: 5
- 被引次数: 0
出版历程
- 收稿日期: 2026-02-10
- 录用日期: 2026-05-07
- 网络出版日期: 2026-07-02

Adaptive Manifold Constrained Reinforcement Learning for Dynamic Balance Between Safety and Performance

LIU Zheng-Tang^{1, 2, 3
,},
WU Yue^{1, 2, 3
,},
HU Chun-He^{1, 2, 3
,},
LU Ren-Zhi^4
,,
CHENG Cheng^4
,,
ZHENG Yi-Li^{1, 2, 3
,}

1.
School of Technology, Beijing Forestry University, Beijing 100083
2.
State Key Laboratory of Efficient Production of Forest Resources, Beijing 100083
3.
Key Laboratory of National Forestry and Grassland Administration on Forestry Equipment and Automation, Beijing 100083
4.
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074

Funds: Supported by National Natural Science Foundation of China (62273053)

More Information

Author Bio:
LIU Zheng-Tang　A master＇s student at the College of Engineering, Beijing Forestry University. Her main research interests are safety-reinforcement learning and multi-robot cooperative control

WU Yue　Associate Professor at the College of Engineering, Beijing Forestry University. His main research interests lie in multi-robot cooperative control and reinforcement learning. Corresponding author of this paper

HU Chun-He　Associate Professor at School of Technology, Beijing Forestry University. His main research interests include safe reinforcement learning for robots, autonomous robot control and other related fields

LU Ren-Zhi　Associate Professor at the School of Artificial Intelligence and Automation, Huazhong University of Science and Technology. His main research interest includes reinforcement learning and unmanned systems

CHENG Cheng　Associate Professor at the School of Artificial Intelligence and Automation, Huazhong University of Science and Technology. Her main research interests lie in system identification and intelligent mixed-line assembly

ZHENG Yi-Li　Professor at the College of Engineering, Beijing Forestry University. His main research interests lie in smart forestry monitoring and robot control

摘要

摘要: 在机器人走向实际场景的过程中, 如何在严格满足安全约束的同时兼顾学习探索效率, 已成为制约智能机器人实用化的重要挑战.现有安全学习方法依赖固定控制增益参数的设计范式, 难以在训练过程中动态平衡探索效率与约束满足, 易导致过度保守或约束保持不足.为此, 提出一种面向安全-性能动态平衡的状态相关增益调度自适应 ATACOM 方法, 简称 SGA-ATACOM. 在保持 ATACOM 原有的约束流形切空间探索、解析约束保持和误差修正控制结构不变的基础上, 构建约束层与控制器层的双层在线调度机制. 其中, 约束层依据当前约束状态、变化趋势及短时预测结果, 自适应调节可行性约束参数; 控制器层结合约束残差与速度风险指标, 在线更新误差修正增益和速度可行域增益, 从而实现在风险指标较高时增强约束回拉与速度保护、在风险指标较低时释放更有效的探索空间的自适应协调. 为验证所提方法的有效性, 分别在 CircularMotion 任务和 Planar Air Hockey 机械臂任务上, 结合不同强化学习算法开展仿真实验. 实验结果表明, 该算法在不同任务场景和不同强化学习算法下均表现出良好的通用性, 能够在维持较低约束违反的同时取得更优的安全−性能平衡.消融实验进一步验证了约束层调度与控制器层调度的互补作用. 该算法为受约束强化学习中的安全−性能协同优化提供了一种具有几何可解释性和结构保持性的改进方案.
- 安全强化学习 /
- 流形约束 /
- ATACOM /
- 状态相关增益调度 /
- 安全−性能平衡
Abstract: In the deployment of robots to real-world scenarios, how to achieve a balance between strict safety constraint satisfaction and learning exploration efficiency has become a critical challenge restricting the practical application of intelligent robotic systems. Existing safe learning methods generally rely on fixed control-gain design paradigms, making it difficult to dynamically balance exploration efficiency and constraint satisfaction during training, which may lead to overly conservative behavior or insufficient constraint maintenance. To address this issue, this paper proposes a state-dependent gain-scheduled adaptive ATACOM method for dynamic safety-performance balancing, termed SGA-ATACOM. Without altering the original ATACOM structure of tangent-space exploration on the constraint manifold, analytical constraint enforcement, and error-correction control, a two-layer online scheduling mechanism consisting of a constraint layer and a controller layer is developed. Specifically, the constraint layer adaptively adjusts the viability-constraint parameters according to the current constraint state, its variation trend, and short-term prediction results, while the controller layer updates the error-correction gain and the velocity-feasibility gain online based on the constraint residual and velocity-risk indicators. In this way, stronger constraint recovery and velocity protection are activated when the risk level is high, whereas more effective exploration space is released when the risk level is low. Simulation experiments are conducted on the CircularMotion task and the Planar Air Hockey manipulator task using different reinforcement learning algorithms. The results show that the method exhibits good generality across different task scenarios and reinforcement learning algorithms, and achieves a better safety-performance trade-off while maintaining low constraint violations. Ablation studies further verify the complementary roles of the constraint-layer scheduling and controller-layer scheduling mechanisms. Overall, the proposed method provides a geometrically interpretable and structure-preserving improvement for coordinated safety-performance optimization in constrained reinforcement learning.
- Safe reinforcement learning /
- constraint manifold /
- ATACOM /
- state-dependent gain scheduling /
- safety-performance balance

HTML全文

图 1 SGA-ATACOM的整体框架

Fig. 1 The overall framework of SGA-ATACOM

下载: 全尺寸图片幻灯片

图 2 仿真实验环境示意图

Fig. 2 Schematic diagram of the simulation environments

下载: 全尺寸图片幻灯片

图 3 CircularMotion环境下不同方法的性能与约束对比

Fig. 3 Comparison of performance and constraints of different methods in the CircularMotion environment

下载: 全尺寸图片幻灯片

图 4 CircularMotion环境下SGA-ATACOM在不同强化学习算法上的训练曲线对比

Fig. 4 Training curves of SGA-ATACOM under different reinforcement learning algorithms in the CircularMotion environment

下载: 全尺寸图片幻灯片

图 5 Planar Air Hockey环境下不同方法的性能与约束对比

Fig. 5 Performance and constraint comparison of different methods in the Planar Air Hockey environment

下载: 全尺寸图片幻灯片

图 6 Planar Air Hockey环境下SGA-ATACOM在不同强化学习算法上的训练曲线对比

Fig. 6 Comparison of training curves of SGA-ATACOM on different reinforcement learning algorithms in the Planar Air Hockey environment

下载: 全尺寸图片幻灯片

图 7 CircularMotion环境下SGA-ATACOM的消融实验结果

Fig. 7 Ablation results of SGA-ATACOM in the CircularMotion environment

下载: 全尺寸图片幻灯片

图 8 训练过程中的增益演化曲线

Fig. 8 Gain evolution curve during training

下载: 全尺寸图片幻灯片

图 9 约束层状态-参数对应曲线

Fig. 9 Constrained layer state - parameter correspondence curve

下载: 全尺寸图片幻灯片

图 10 控制器层状态-参数对应曲线

Fig. 10 Controller layer state - parameter correspondence curve

下载: 全尺寸图片幻灯片

表 1 SGA-ATACOM 的关键方法超参数

Table 1 Key method hyperparameters of SGA-ATACOM

模块	参数	CircularMotion	Planar Air Hockey
约束层	基线 $ K_f/K_g $	$ 0.1 / 2.0 $	$ 0.5 / 1.0 $
	调节范围 $ [k_{\min},\; k_{\max}] $	$ [0.35,\; 1.0] / [0.40,\; 1.0] $	$ [0.45,\; 1.0] / [0.50,\; 1.0] $
	adaptation_rate	$ 0.25 / 0.30 $	$ 0.18 / 0.16 $
	danger_gain / relax_gain	$ 4.0,\; 0.05 / 3.0,\; 0.05 $	$ 2.5,\; 0.04 / 2.0,\; 0.05 $
控制器层	基线 $ K_c/K_q $	$ 100 / 20 $	$ 240 / 2a_{\max}/v_{\max} $
	最大放缩系数	$ 2.0 / 2.0 $	$ 1.8 / 1.8 $
	adaptation_rate / decay_ratio	$ 0.15,\; 0.35 / 0.20,\; 0.35 $	$ 0.12,\; 0.35 / 0.16,\; 0.30 $
	trigger / gain / EMA	$ 0.85,\; 1.5,\; 0.20 / 0.18 $	$ 0.88,\; 1.2,\; 0.18 / 0.18 $

下载: 导出CSV

表 2 CircularMotion 环境下不同方法的定量结果

Table 2 Quantitative results of different methods in the CircularMotion environment

算法	方法	$ R $	$ c_{\mathrm{avg}} $	$ c_{\max} $	$ c_{\dot{q},\;\max} $
DDPG	ATACOM	$ 235.7015 $	$ 0.006402 $	$ 0.031223 $	$ 0 $
	SGA-ATACOM	$ 247.1787 $	$ 0.003576 $	$ 0.012514 $	$ 0 $
	误差修正	$ 74.95398 $	$ 0.031649 $	$ 0.140175 $	$ 0 $
	终止机制	$ -95.7393 $	$ 0.118951 $	$ 0.428035 $	$ 0.498038 $
PPO	ATACOM	$ 278.9282 $	$ 0.006523 $	$ 0.031222 $	$ 0 $
	SGA-ATACOM	$ 300.6381 $	$ 0.002561 $	$ 0.012518 $	$ 0 $
	误差修正	$ 192.1773 $	$ 0.028069 $	$ 0.139112 $	$ 0 $
	终止机制	$ 71.2089 $	$ 0.101304 $	$ 0.401823 $	$ 0.046248 $
TRPO	ATACOM	$ 264.6153 $	$ 0.006437 $	$ 0.031204 $	$ 0 $
	SGA-ATACOM	$ 293.5804 $	$ 0.003045 $	$ 0.012515 $	$ 0 $
	误差修正	$ 109.1742 $	$ 0.039597 $	$ 0.135171 $	$ 0 $
	终止机制	$ 62.17836 $	$ 0.180728 $	$ 0.405737 $	$ 0.313938 $

下载: 导出CSV

表 3 Planar Air Hockey 环境下不同方法的定量结果

Table 3 Quantitative results of different methods in the Planar Air Hockey environment

算法	方法	$ R $	$ c_{\mathrm{avg}} $	$ c_{\max} $	$ c_{\dot{q},\;\max} $
PPO	ATACOM	$ 440.1124 $	$ 0 $	$ 0 $	$ 0 $
	SGA-ATACOM	$ 525.2286 $	$ 0 $	$ 0 $	$ 0 $
	无约束方法	$ 280.524 $	$ 0.000182 $	$ 0.033132 $	$ 1.178097 $
TRPO	ATACOM	$ 306.4537 $	$ 0 $	$ 0 $	$ 0 $
	SGA-ATACOM	$ 438.2295 $	$ 0 $	$ 0 $	$ 0.055845 $
	无约束方法	$ 215.1846 $	$ 0.019395 $	$ 0.37453 $	$ 1.178097 $

下载: 导出CSV

表 4 CircularMotion 环境下消融实验的定量结果

Table 4 Quantitative results of ablation experiments in CircularMotion environment

方法	$ R $	$ c_{\mathrm{avg}} $	$ c_{\max} $	$ c_{\dot{q},\;\max} $
ATACOM	$ 264.615\,3 $	$ 0.006\,437 $	$ 0.031\,204 $	$ 0 $
仅控制器层调度	$ 254.084\,3 $	$ 0.004\,486 $	$ 0.024\,616 $	$ 0 $
仅约束层调度	$ 235.940\,3 $	$ 0.004\,151 $	$ 0.015\,066 $	$ 0 $
SGA-ATACOM	$ 293.580\,4 $	$ 0.003\,045 $	$ 0.012\,515 $	$ 0 $

下载: 导出CSV

参考文献(29)

[1]	Gu S, Yang L, Du Y, et al. A review of safe reinforcement learning: Methods, theories, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 11216−11235 doi: 10.1109/TPAMI.2024.3457538
[2]	Lee J, Schroth L, Klemm V, et al. Exploring constrained reinforcement learning algorithms for quadrupedal locomotion[C]//2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024: 11132−11138
[3]	Keyumarsi S, Atman M W S, Gusrialdi A. LiDAR-based online control barrier function synthesis for safe navigation in unknown environments. IEEE Robotics and Automation Letters, 2023, 9(2): 1043−1050 doi: 10.1109/lra.2023.3339059
[4]	周毅, 张浩, 施孟佶, 等. 未知环境中基于控制障碍函数的机器人安全控制研究综述. 电子科技大学学报, 2025, 54(01): 29−38 doi: 10.12178/1001-0548.2023296 ZHOU Y, ZHANG H, SHI M J, et al. A review of research on robot safety control based on control barrier functions in unknown environments. Journal of University of Electronic Science and Technology of China, 2025, 54(01): 29−38 doi: 10.12178/1001-0548.2023296
[5]	K?nighofer B, Bloem R, Jansen N, et al. Shields for safe reinforcement learning. Communications of the ACM, 2025, 68(11): 80−90 doi: 10.1145/3715958
[6]	Dawood M, Shokry A, Bennewitz M. A dynamic safety shield for safe and efficient reinforcement learning of navigation tasks[J]. arXiv preprint arXiv: 2412.04153, 2024
[7]	董明泽, 温庄磊, 陈锡爱, 等. 安全凸空间与深度强化学习结合的机器人导航方法. 兵工学报, 2024, 45(12): 4372−4382 doi: 10.12382/bgxb.2023.0982 DONG Mingze, WEN Zhuanglei, CHEN Xiai, et al. Research on Robot Navigation Method Integrating Safe Convex Space and Deep Reinforcement Learning. Journal of China Ordnance, 2024, 45(12): 4372−4382 doi: 10.12382/bgxb.2023.0982
[8]	Liu P, Tateo D, Ammar H B, et al. Robot reinforcement learning on the constraint manifold[C]//Conference on Robot Learning. PMLR, 2022: 1357-1366
[9]	张昌昕, 张兴龙, 徐昕, 等. 安全强化学习及其在机器人系统中的应用综述. 控制理论与应用, 2023, 40(12): 2090−2103 doi: 10.7641/CTA.2023.30247 ZHANG Changxin, ZHANG Xinglong, XU Xin, et al. Safe reinforcement learning and its applications in robotics: A survey. Control Theory & Applications, 2023, 40(12): 2090−2103 doi: 10.7641/CTA.2023.30247
[10]	Altman E.Constrained Markov Decision Processes[M]. CRC Press: 2021-11-15
[11]	Achiam J, Held D, Tamar A, et al. Constrained policy optimization[C]//International conference on machine learning. PMLR, 2017: 22-31
[12]	Liu Y, Ding J, Liu X. Ipo: Interior-point policy optimization under constraints[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(04): 4940-4947
[13]	Tessler C, Mankowitz D J, Mannor S. Reward constrained policy optimization[J]. arXiv preprint arXiv: 1805.11074, 2018
[14]	Stooke A, Achiam J, Abbeel P. Responsive safety in reinforcement learning by pid lagrangian methods[C]//International Conference on Machine Learning. PMLR, 2020: 9133-9143
[15]	Ding D, Wei X, Yang Z, et al. Provably efficient safe exploration via primal-dual policy optimization[C]//International conference on artificial intelligence and statistics. PMLR, 2021: 3304−3312
[16]	Chow Y, Nachum O, Duenez-Guzman E, et al. A lyapunov-based approach to safe reinforcement learning[J]. Advances in neural information processing systems, 2018, 31
[17]	陈谋, 刘伟, 张鹏. 性能约束下的四旋翼无人机协同吊挂系统分布式避碰跟踪控制. 自动化学报, 2024, 50(12): 2392−2406 doi: 10.16383/j.aas.c240349 Chen Mou, Liu Wei, Zhang Peng. Distributed collision avoidance tracking control for quadrotor cooperative suspension system under performance constraints. Acta Automatica Sinica, 2024, 50(12): 2392−2406 doi: 10.16383/j.aas.c240349
[18]	Garcia J, Fernández F. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 2012, 45: 515−564
[19]	Alshiekh M, Bloem R, Ehlers R, et al. Safe reinforcement learning via shielding[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1
[20]	Hans A, Schneega? D, Sch?fer A M, et al. Safe exploration for reinforcement learning[C]//ESANN. 2008: 143-148
[21]	赵静, 裴子楠, 姜斌, 等. 基于深度强化学习的无人机虚拟管道视觉避障. 自动化学报, 2024, 50(11): 2245−2258 doi: 10.16383/j.aas.c230728 Zhao Jing, Pei Zi-Nan, Jiang Bin, Lu Ning-Yun, Zhao Fei, Chen Shu-Feng. Virtual tube visual obstacle avoidance for UAV based on deep reinforcement learning. Acta Automatica Sinica, 2024, 50(11): 2245−2258 doi: 10.16383/j.aas.c230728
[22]	Dalal G, Dvijotham K, Vecerik M, et al. Safe exploration in continuous action spaces[J]. arXiv preprint arXiv: 1801.08757, 2018
[23]	Cheng R, Orosz G, Murray R M, et al. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks[C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33(01): 3387-3395
[24]	Koller T, Berkenkamp F, Turchetta M, et al. Learning-based model predictive control for safe exploration[C]//2018 IEEE conference on decision and control (CDC). IEEE, 2018: 6059−6066
[25]	Hewing L, Wabersich K P, Menner M, et al. Learning-based model predictive control: Toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 2020, 3(1): 269−296 doi: 10.1146/annurev-control-090419-075625
[26]	Liu P, Bou-Ammar H, Peters J, et al. Safe reinforcement learning on the constraint manifold: Theory and applications[J]. IEEE Transactions on Robotics, 2025
[27]	Liu P, Zhang K, Tateo D, et al. Safe reinforcement learning of dynamic high-dimensional robotic tasks: navigation, manipulation, interaction[J]. arXiv preprint arXiv: 2209.13308, 2022
[28]	张楠杰, 陈玉全, 季茂沁, 等. 面向不同粗糙程度地面的四足机器人自适应控制方法. 自动化学报, 2025, 51(07): 1585−1598 doi: 10.16383/j.aas.c240738 Zhang Nan-Jie, Chen Yu-Quan, Ji Mao-Qin, Sun Yun-Kang, Wang Bing. Adaptive control method for quadruped robot facing floors of different roughness. Acta Automatica Sinica, 2025, 51(07): 1585−1598 doi: 10.16383/j.aas.c240738
[29]	Miki T, Lee J, Hwangbo J, et al. Learning robust perceptive locomotion for quadrupedal robots in the wild. Science robotics, 2022, 7(62): eabk2822 doi: 10.1126/scirobotics.abk2822