安全强化学习综述

王雪松; 王荣荣; 程玉虎

doi:10.16383/j.aas.c220631

安全强化学习综述

doi: 10.16383/j.aas.c220631

1.
中国矿业大学信息与控制工程学院徐州 221116

基金项目: 国家自然科学基金(62176259, 61976215), 江苏省重点研发计划项目(BE2022095)资助

详细信息

作者简介:
王雪松：中国矿业大学教授. 2002年获得中国矿业大学博士学位. 主要研究方向为机器学习, 模式识别. E-mail: wangxuesongcumt@163.com

王荣荣：中国矿业大学博士研究生. 2021年获得济南大学硕士学位. 主要研究方向为深度强化学习. E-mail: wangrongrong1996@126.com

程玉虎：中国矿业大学教授. 2005年获得中国科学院自动化研究所博士学位. 主要研究方向为机器学习, 智能系统. 本文通信作者. E-mail: chengyuhu@163.com

计量
- 文章访问数: 6320
- HTML全文浏览量: 7176
- PDF下载量: 2247
- 被引次数: 0
出版历程
- 收稿日期: 2022-08-08
- 录用日期: 2023-01-11
- 网络出版日期: 2023-03-09
- 刊出日期: 2023-09-26

Safe Reinforcement Learning: A Survey

1.
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116

Funds: Supported by National Natural Science Foundation of China (62176259, 61976215) and Key Research and Development Program of Jiangsu Province (BE2022095)

More Information

Author Bio:
WANG Xue-Song　Professor at China University of Mining and Technology. She received her Ph.D. degree from China University of Mining and Technology in 2002. Her research interest covers machine learning and pattern recognition

WANG Rong-Rong　Ph.D. candidate at China University of Mining and Technology. She received her master degree from University of Jinan in 2021. Her main research interest is deep reinforcement learning

CHENG Yu-Hu　Professor at China University of Mining and Technology. He received his Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences in 2005. His research interest covers machine learning and intelligent system. Corresponding author of this paper

摘要

摘要: 强化学习(Reinforcement learning, RL)在围棋、视频游戏、导航、推荐系统等领域均取得了巨大成功. 然而, 许多强化学习算法仍然无法直接移植到真实物理环境中. 这是因为在模拟场景下智能体能以不断试错的方式与环境进行交互, 从而学习最优策略. 但考虑到安全因素, 很多现实世界的应用则要求限制智能体的随机探索行为. 因此, 安全问题成为强化学习从模拟到现实的一个重要挑战. 近年来, 许多研究致力于开发安全强化学习(Safe reinforcement learning, SRL)算法, 在确保系统性能的同时满足安全约束. 本文对现有的安全强化学习算法进行全面综述, 将其归为三类: 修改学习过程、修改学习目标、离线强化学习, 并介绍了5大基准测试平台: Safety Gym、safe-control-gym、SafeRL-Kit、D4RL、NeoRL. 最后总结了安全强化学习在自动驾驶、机器人控制、工业过程控制、电力系统优化和医疗健康领域中的应用, 并给出结论与展望.
- 安全强化学习 /
- 约束马尔科夫决策过程 /
- 学习过程 /
- 学习目标 /
- 离线强化学习
Abstract: Reinforcement learning (RL) has proved a prominent success in the game of Go, video games, navigation, recommendation systems and other fields. However, a large number of reinforcement learning algorithms cannot be directly transplanted to real physical environment. This is because in the simulation scenario, the agent is able to interact with the environment in a trial-and-error manner to learn the optimal policy. Considering the safety of systems, many real-world applications require the limitation of random exploration behavior of agents. Hence, safety has become an essential factor for reinforcement learning from simulation to reality. In recent years, many researches have been devoted to develope safe reinforcement learning (SRL) algorithms that satisfy safety constraints while ensuring system performance. This paper presents a comprehensive survey of existing SRL algorithms, which are divided into three categories: Modification of learning process, modification of learning objective, and offline reinforcement learning. Furthermore, five experimental platforms are introduced, including Safety Gym, safe-control-gym, SafeRL-Kit, D4RL, and NeoRL. Lastly, the applications of SRL in the fields of autonomous driving, robot control, industrial process control, power system optimization, and healthcare are summarized, and the conclusion and perspective are briefly drawn.
- Safe reinforcement learning (SRL) /
- constrained Markov decision process (CMDP) /
- learning process /
- learning objective /
- offline reinforcement learning

HTML全文

图 1 安全强化学习方法、基准测试平台与应用

Fig. 1 Methods, benchmarking platforms, and applications of safe reinforcement learning

下载: 全尺寸图片幻灯片

表 1 安全强化学习方法对比

Table 1 Comparison of safe reinforcement learning methods

方法类别		训练时安全	部署时安全	与环境实时交互	优点	缺点	应用领域
修改学习过程	环境知识	√	√	√	采样效率高	需获取环境的动力学模型、实现复杂	自动驾驶^{[12−13, 23]}、工业过程控制^[24−25]、电力系统优化^[26]、医疗健康^[21]
	人类知识	√	√	√	加快学习过程	人工监督成本高	机器人控制^{[14, 27]}、电力系统优化^[28]、医疗健康^[29]
	无先验知识	√	√	√	无需获取先验知识、可扩展性强	收敛性差、训练不稳定	自动驾驶^[30]、机器人控制^[31]、工业过程控制^[32]、电力系统优化^[33]、医疗健康^[34]
修改学习目标	拉格朗日法	×	√	√	思路简单、易于实现	拉格朗日乘子选取困难	工业过程控制^[15]、电力系统优化^[16]
修改学习目标	信赖域法	√	√	√	收敛性好、训练稳定	近似误差不可忽略、采样效率低	机器人控制^[35]
离线强化学习	策略约束	√	×	×	收敛性好	方差大、采样效率低	医疗健康^[36]
	值约束	√	×	×	值函数估计方差小	收敛性差	工业过程控制^[22]
	预训练模型	√	×	×	加快学习过程、泛化性强	实现复杂	工业过程控制^[37]

下载: 导出CSV

表 2 安全强化学习基准测试平台对比

Table 2 Comparison of benchmarking platforms for safe reinforcement learning

基准测试平台	任务类型	适用方法	基准算法类型		特点
Safety Gym	机器人导航	修改学习过程与目标	无模型方法	同策略	包含多个高维连续控制任务, 使用最广泛的安全强化学习算法评估平台
safe-control-gym	机器人控制	修改学习过程与目标	无模型方法与基于模型的方法	同策略与异策略	能实现基于模型的方法, 可以方便地与控制类方法进行对比
SafeRL-Kit	自动驾驶	修改学习过程与目标	无模型方法	异策略	首个针对自动驾驶任务的异策略安全强化学习算法基准测试平台
D4RL	机器人导航与控制、自动驾驶	离线强化学习	无模型方法	离线学习	收集有多个环境的离线数据, 已成为离线强化学习算法的标准评估平台
NeoRL	机器人控制、工业控制、股票交易、产品促销	离线强化学习	无模型方法与基于模型的方法	离线学习	包含多个高维或具有高度随机性的现实应用场景任务

下载: 导出CSV

参考文献(123)

[1]	Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 2018.
[2]	Dong S, Wang P, Abbas K. A survey on deep learning and its applications. Computer Science Review, 2021, 40: Article No. 100379 doi: 10.1016/j.cosrev.2021.100379
[3]	文载道, 王佳蕊, 王小旭, 潘泉. 解耦表征学习综述. 自动化学报, 2022, 48(2): 351-374 Wen Z D, Wang J R, Wang X X, Pan Q. A review of disentangled representation learning. Acta Automatica Sinica, 2022, 48(2): 351-374
[4]	Silver D, Huang A, Maddison C, Guez A, Sifre L, Driessche G, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484-489 doi: 10.1038/nature16961
[5]	Shao K, Tang Z T, Zhu Y H, Li N N, Zhao D B. A survey of deep reinforcement learning in video games. arXiv preprint arXiv: 1912.10944, 2019.
[6]	Kiran B R, Sobh I, Talpaert V, Mannion P, Sallab A A A, Yogamani S, et al. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(6): 4909-4926 doi: 10.1109/TITS.2021.3054625
[7]	黄艳龙, 徐德, 谭民. 机器人运动轨迹的模仿学习综述. 自动化学报, 2022, 48(2): 315-334 Huang Y L, Xu D, Tan M. On imitation learning of robot movement trajectories: A survey. Acta Automatica Sinica, 2022, 48(2): 315-334
[8]	Zhang Z D, Zhang D X, Qiu R C. Deep reinforcement learning for power system applications: An overview. CSEE Journal of Power and Energy Systems, 2020, 6(1): 213-225
[9]	刘健, 顾扬, 程玉虎, 王雪松. 基于多智能体强化学习的乳腺癌致病基因预测. 自动化学报, 2022, 48(5): 1246-1258 doi: 10.16383/j.aas.c210583 Liu J, Gu Y, Cheng Y H, Wang X S. Prediction of breast cancer pathogenic genes based on multi-agent reinforcement learning. Acta Automatica Sinica, 2022, 48(5): 1246-1258 doi: 10.16383/j.aas.c210583
[10]	García J, Fernández F. A comprehensive survey on safe reinforcement learning. The Journal of Machine Learning Research, 2015, 16(1): 1437-1480
[11]	Altman E. Constrained Markov Decision Processes: Stochastic Modeling. New York: Routledge, 1999.
[12]	Kamran D, Ren Y, Lauer M. High-level decisions from a safe maneuver catalog with reinforcement learning for safe and cooperative automated merging. In: Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC). Indiana, USA: IEEE, 2021. 804−811
[13]	Trumpp R, Bayerlein H, Gesbert D. Modeling interactions of autonomous vehicles and pedestrians with deep multi-agent reinforcement learning for collision avoidance. In: Proceedings of the IEEE Intelligent Vehicles Symposium (IV). Aachen, Germany: IEEE, 2022. 331−336
[14]	Yang T Y, Zhang T N, Luu L, Ha S, Tan J, Yu W H. Safe reinforcement learning for legged locomotion. arXiv preprint arXiv: 2203.02638, 2022.
[15]	赵恒军, 李权忠, 曾霞, 刘志明. 安全强化学习算法及其在CPS智能控制中的应用. 软件学报, 2022, 33(7): 2538-2561 doi: 10.13328/j.cnki.jos.006588 Zhao H J, Li Q Z, Zeng X, Liu Z M. Safe reinforcement learning algorithm and its application in intelligent control for CPS. Journal of Software, 2022, 33(7): 2538-2561 doi: 10.13328/j.cnki.jos.006588
[16]	季颖, 王建辉. 基于深度强化学习的微电网在线优化调度. 控制与决策, 2022, 37(7): 1675-1684 doi: 10.13195/j.kzyjc.2021.0835 Ji Y, Wang J H. Online optimal scheduling of a microgrid based on deep reinforcement learning. Control and Decision, 2022, 37(7): 1675-1684 doi: 10.13195/j.kzyjc.2021.0835
[17]	Zhang L R, Zhang Q, Shen L, Yuan B, Wang X Q. SafeRL-Kit: Evaluating efficient reinforcement learning methods for safe autonomous driving. arXiv preprint arXiv: 2206.08528, 2022.
[18]	Thananjeyan B, Balakrishna A, Nair S, Luo M, Srinivasan K, Hwang M, et al. Recovery RL: Safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation Letters, 2021, 6(3): 4915-4922 doi: 10.1109/LRA.2021.3070252
[19]	Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv: 2005.01643, 2020.
[20]	Prudencio R F, Máximo M R O A, Colombini E L. A survey on offline reinforcement learning: Taxonomy, review, and open problems. arXiv preprint arXiv: 2203.01387, 2022.
[21]	Ji G L, Yan J Y, Du J X, Yan W Q, Chen J B, Lu Y K, et al. Towards safe control of continuum manipulator using shielded multiagent reinforcement learning. IEEE Robotics and Automation Letters, 2021, 6(4): 7461-7468 doi: 10.1109/LRA.2021.3097660
[22]	Zhan X Y, Xu H R, Zhang Y, Zhu X Y, Yin H L, Zheng Y. DeepThermal: Combustion optimization for thermal power generating units using offline reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. California, USA: AAAI Press, 2022. 4680−4688
[23]	Zhang Y X, Gao B Z, Guo L L, Guo H Y, Chen H. Adaptive decision-making for automated vehicles under roundabout scenarios using optimization embedded reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(12): 5526-5538 doi: 10.1109/TNNLS.2020.3042981
[24]	Savage T, Zhang D D, Mowbray M, Chanona E A D R. Model-free safe reinforcement learning for chemical processes using Gaussian processes. IFAC-PapersOnLine, 2021, 54(3): 504-509 doi: 10.1016/j.ifacol.2021.08.292
[25]	Mowbray M, Petsagkourakis P, Chanona E A, Zhang D D. Safe chance constrained reinforcement learning for batch process control. Computers & Chemical Engineering, 2022, 157: Article No. 107630
[26]	Vu T L, Mukherjee S, Huang R K, Huang Q H. Barrier function-based safe reinforcement learning for emergency control of power systems. In: Proceedings of the 60th IEEE Conference on Decision and Control (CDC). Texas, USA: IEEE, 2021. 3652−3657
[27]	García J, Shafie D. Teaching a humanoid robot to walk faster through safe reinforcement learning. Engineering Applications of Artificial Intelligence, 2020, 88: Article No. 103360 doi: 10.1016/j.engappai.2019.103360
[28]	Du Y, Wu D. Deep reinforcement learning from demonstrations to assist service restoration in islanded microgrids. IEEE Transactions on Sustainable Energy, 2022, 13(2): 1062-1072 doi: 10.1109/TSTE.2022.3148236
[29]	Pore A, Corsi D, Marchesini E, Dall＇Alba D, Casals A, Farinelli A, et al. Safe reinforcement learning using formal verification for tissue retraction in autonomous robotic-assisted surgery. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Prague, Czech Republic: IEEE, 2021. 4025−4031
[30]	代珊珊, 刘全. 基于动作约束深度强化学习的安全自动驾驶方法. 计算机科学, 2021, 48(9): 235-243 doi: 10.11896/jsjkx.201000084 Dai S S, Liu Q. Action constrained deep reinforcement learning based safe automatic driving method. Computer Science, 2021, 48(9): 235-243 doi: 10.11896/jsjkx.201000084
[31]	Zhu X, Kang S C, Chen J Y. A contact-safe reinforcement learning framework for contact-rich robot manipulation. In: Proceedings of the International Conference on Intelligent Robots and Systems (IROS). Kyoto, Japan: IEEE, 2022. 2476−2482
[32]	Pan E, Petsagkourakis P, Mowbray M, Zhang D D, Chanona E A D R. Constrained model-free reinforcement learning for process optimization. Computers & Chemical Engineering, 2021, 154: 107462
[33]	Tabas D, Zhang B S. Computationally efficient safe reinforcement learning for power systems. In: Proceedings of the American Control Conference. Georgia, USA: IEEE, 2022. 3303−3310
[34]	Misra S, Deb P K, Koppala N, Mukherjee A, Mao S W. S-Nav: Safety-aware IoT navigation tool for avoiding COVID-19 hotspots. IEEE Internet of Things Journal, 2021, 8(8): 6975-6982 doi: 10.1109/JIOT.2020.3037641
[35]	Corsi D, Yerushalmi R, Amir G, Farinelli A, Harel D, Katz G. Constrained reinforcement learning for robotics via scenario-based programming. arXiv preprint arXiv: 2206.09603, 2022.
[36]	Zhang K, Wang Y H, Du J Z, Chu B, Celi L A, Kindle R, et al. Identifying decision points for safe and interpretable reinforcement learning in hypotension treatment. arXiv preprint arXiv: 2101.03309, 2021.
[37]	Zhao T Z, Luo J L, Sushkov O, Pevceviciute R, Heess N, Scholz J, et al. Offline meta-reinforcement learning for industrial insertion. In: Proceedings of the International Conference on Robotics and Automation (ICRA). Philadelphia, PA, USA: IEEE, 2022. 6386−6393
[38]	Sui Y N, Gotovos A, Burdick J W, Krause A. Safe exploration for optimization with Gaussian processes. In: Proceedings of the International Conference on Machine Learning. Lille, France: PMLR, 2015. 997−1005
[39]	Turchetta M, Berkenkamp F, Krause A. Safe exploration in finite Markov decision processes with Gaussian processes. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc., 2016. 4312−4320
[40]	Wachi A, Kajino H, Munawar A. Safe exploration in Markov decision processes with time-variant safety using spatio-temporal Gaussian process. arXiv preprint arXiv: 1809.04232, 2018.
[41]	Alshiekh M, Bloem R, Ehlers R, Könighofer B, Niekum S, Topcu U. Safe reinforcement learning via shielding. In: Proceedings of the AAAI Conference on Artificial Intelligence. Lousiana, USA: AAAI Press, 2018. 2669−2678
[42]	Zhang W B, Bastani O, Kumar V. MAMPS: Safe multi-agent reinforcement learning via model predictive shielding. arXiv preprint arXiv: 1910.12639, 2019.
[43]	Jansen N, Könighofer B, Junges S, Serban A C, Bloem R. Safe reinforcement learning via probabilistic shields. arXiv preprint arXiv: 1807.06096, 2018.
[44]	Li S, Bastani O. Robust model predictive shielding for safe reinforcement learning with stochastic dynamics. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Paris, France: IEEE, 2020. 7166−7172
[45]	Bastani O. Safe reinforcement learning with nonlinear dynamics via model predictive shielding. In: Proceedings of the American Control Conference. Los Angeles, USA: IEEE, 2021. 3488−3494
[46]	Perkins T J, Barto A G. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 2003, 3: 803-832
[47]	Berkenkamp F, Turchetta M, Schoellig A, Krause A. Safe model-based reinforcement learning with stability guarantees. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. California, USA: Curran Associates Inc., 2017. 908−919
[48]	Chow Y, Nachum O, Faust A, Ghavamzadeh M, Duéñez-Guzmán E. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv: 1901.10031, 2019.
[49]	Jeddi A B, Dehghani N L, Shafieezadeh A. Lyapunov-based uncertainty-aware safe reinforcement learning. arXiv preprint arXiv: 2107.13944, 2021.
[50]	Cheng R, Orosz G, Murray R M, Burdick J W. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: Proceedings of the AAAI Conference on Artificial Intelligence. Hawaii, USA: AAAI Press, 2019. 3387−3395
[51]	Yang Y L, Vamvoudakis K G, Modares H, Yin Y X, Wunsch D C. Safe intermittent reinforcement learning with static and dynamic event generators. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12): 5441-5455 doi: 10.1109/TNNLS.2020.2967871
[52]	Marvi Z, Kiumarsi B. Safe reinforcement learning: A control barrier function optimization approach. International Journal of Robust and Nonlinear Control, 2021, 31(6): 1923-1940 doi: 10.1002/rnc.5132
[53]	Emam Y, Notomista G, Glotfelter P, Kira Z, Egerstedt M. Safe model-based reinforcement learning using robust control barrier functions. arXiv preprint arXiv: 2110.05415, 2021.
[54]	Bura A, HasanzadeZonuzy A, Kalathil D, Shakkottai S, Chamberland J F. Safe exploration for constrained reinforcement learning with provable guarantees. arXiv preprint arXiv: 2112.00885, 2021.
[55]	Thomas G, Luo Y P, Ma T Y. Safe reinforcement learning by imagining the near future. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc., 2021. 13859−13869
[56]	Ma Y J, Shen A, Bastani O, Jayaraman D. Conservative and adaptive penalty for model-based safe reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2022. 5404−5412
[57]	Saunders W, Sastry G, Stuhlmüller A, Evans O. Trial without error: Towards safe reinforcement learning via human intervention. In: Proceedings of the International Conference on Autonomous Agents and MultiAgent Systems. Stockholm, Sweden: IFAAMAS, 2018. 2067−2069
[58]	Prakash B, Khatwani M, Waytowich N, Mohsenin T. Improving safety in reinforcement learning using model-based architectures and human intervention. In: Proceedings of the International Flairs Conference. Florida, USA: AAAI Press, 2019. 50−55
[59]	Sun H, Xu Z P, Fang M, Peng Z H, Guo J D, Dai B, et al. Safe exploration by solving early terminated MDP. arXiv preprint arXiv: 2107.04200, 2021.
[60]	Prakash B, Waytowich N R, Ganesan A, Oates T, Mohsenin T. Guiding safe reinforcement learning policies using structured language constraints. In: Proceedings of the SafeAI Workshop of AAAI Conference on Artificial Intelligence. New York, USA: AAAI Press, 2020. 153−161
[61]	Yang T Y, Hu M, Chow Y, Ramadge P J, Narasimhan K. Safe reinforcement learning with natural language constraints. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc., 2021. 13794−13808
[62]	Turchetta M, Kolobov A, Shah S, Krause A, Agarwal A. Safe reinforcement learning via curriculum induction. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. 12151−12162
[63]	Peng Z H, Li Q Y, Liu C X, Zhou B L. Safe driving via expert guided policy optimization. In: Proceedings of the 5th Conference on Robot Learning. London, UK: PMLR, 2022. 1554−1563
[64]	Li Q Y, Peng Z H, Zhou B L. Efficient learning of safe driving policy via human-AI copilot optimization. arXiv preprint arXiv: 2202.10341, 2022.
[65]	Dalal G, Dvijotham K, Vecerik M, Hester T, Paduraru C, Tassa Y. Safe exploration in continuous action spaces. arXiv preprint arXiv: 1801.08757, 2018.
[66]	朱斐, 吴文, 伏玉琛, 刘全. 基于双深度网络的安全深度强化学习方法. 计算机学报, 2019, 42(8): 1812-1826 doi: 10.11897/SP.J.1016.2019.01812 Zhu F, Wu W, Fu Y C, Liu Q. A dual deep network based secure deep reinforcement learning method. Chinese Journal of Computers, 2019, 42(8): 1812-1826 doi: 10.11897/SP.J.1016.2019.01812
[67]	Zheng L Y, Shi Y Y, Ratliff L J, Zhang B. Safe reinforcement learning of control-affine systems with vertex networks. In: Proceedings of the 3rd Conference on Learning for Dynamics and Control. Zurich, Switzerland: PMLR, 2021. 336−347
[68]	Marchesini E, Corsi D, Farinelli A. Exploring safer behaviors for deep reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2022. 7701−7709
[69]	Mannucci T, van Kampen E J, de Visser C, Chu Q P. Safe exploration algorithms for reinforcement learning controllers. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(4): 1069-1081 doi: 10.1109/TNNLS.2017.2654539
[70]	Memarzadeh M, Pozzi M. Model-free reinforcement learning with model-based safe exploration: Optimizing adaptive recovery process of infrastructure systems. Structural Safety, 2019, 80: 46-55 doi: 10.1016/j.strusafe.2019.04.003
[71]	Wachi A, Wei Y Y, Sui Y N. Safe policy optimization with local generalized linear function approximations. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Montreal, Canada: Curran Associates Inc., 2021. 20759−20771
[72]	Chow Y, Ghavamzadeh M, Janson L, Pavone M. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 2017, 18(1): 6070-6120
[73]	Ma H T, Guan Y, Li S E, Zhang X T, Zheng S F, Chen J Y. Feasible actor-critic: Constrained reinforcement learning for ensuring statewise safety. arXiv preprint arXiv: 2105.10682, 2021.
[74]	Roy J, Girgis R, Romoff J, Bacon P L, Pal C. Direct behavior specification via constrained reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Maryland, USA: PMLR, 2022. 18828−18843
[75]	Sootla A, Cowen-Rivers A I, Jafferjee T, Wang Z Y, Mguni D H, Wang J, et al. Sauté RL: Almost surely safe reinforcement learning using state augmentation. In: Proceedings of the International Conference on Machine Learning. Maryland, USA: PMLR, 2022. 20423−20443
[76]	Tessler C, Mankowitz D J, Mannor S. Reward constrained policy optimization. arXiv preprint arXiv: 1805.11074, 2018.
[77]	Yu M, Yang Z R, Kolar M, Wang Z R. Convergent policy optimization for safe reinforcement learning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2019. 3127−3139
[78]	Bai Q B, Bedi A S, Agarwal M, Koppel A, Aggarwal V. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2022. 3682−3689
[79]	Achiam J, Held D, Tamar A, Abbeel P. Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017. 22−31
[80]	Schulman J, Levine S, Moritz P, Jordan M, Abbeel P. Trust region policy optimization. In: Proceedings of the International Conference on Machine Learning. Lille, France: PMLR, 2015. 1889−1897
[81]	Yang T Y, Rosca J, Narasimhan K, Ramadge P J. Projection-based constrained policy optimization. arXiv preprint arXiv: 2010.03152, 2020.
[82]	Zhang Y M, Vuong Q, Ross K W. First order constrained optimization in policy space. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. 15338−15349
[83]	Zhang L R, Shen L, Yang L, Chen S X, Yuan B, Wang X Q, et al. Penalized proximal policy optimization for safe reinforcement learning. arXiv preprint arXiv: 2205.11814, 2022.
[84]	Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv preprint arXiv: 1707.06347, 2017.
[85]	Xu T Y, Liang Y B, Lan G H. CRPO: A new approach for safe reinforcement learning with convergence guarantee. In: Proceedings of the International Conference on Machine Learning. Vienna, Austria: PMLR, 2021. 11480−11491
[86]	Liu Z X, Cen Z P, Isenbaev V, Liu W, Wu Z S, Li B, et al. Constrained variational policy optimization for safe reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Maryland, USA: PMLR, 2022. 13644−13668
[87]	Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration. In: Proceedings of the International Conference on Machine Learning. California, USA: PMLR, 2019. 2052−2062
[88]	Kumar A, Fu J, Soh M, Tucker G, Levine S. Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2019. 11784−11794
[89]	Zhou W X, Bajracharya S, Held D. PLAS: Latent action space for offline reinforcement learning. In: Proceedings of the Conference on Robot Learning. Cambridge, USA: PMLR, 2020. 1719−1735
[90]	Chen X, Ghadirzadeh A, Yu T H, Gao Y, Wang J H, Li W Z, et al. Latent-variable advantage-weighted policy optimization for offline RL. arXiv preprint arXiv: 2203.08949, 2022.
[91]	Kumar A, Zhou A, Tucker G, Levine S. Conservative Q-learning for offline reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. 1179−1191
[92]	Xu H R, Zhan X Y, Zhu X Y. Constraints penalized Q-learning for safe offline reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. California, USA: AAAI Press, 2022. 8753−8760
[93]	Kostrikov I, Nair A, Levine S. Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv: 2110.06169, 2021.
[94]	Zhang R Y, Dai B, Li L H, Schuurmans D. GenDICE: Generalized offline estimation of stationary values. arXiv preprint arXiv: 2002.09072, 2020.
[95]	Zhan W H, Huang B H, Huang A, Jiang N, Lee J D. Offline reinforcement learning with realizability and single-policy concentrability. In: Proceedings of the Conference on Learning Theory. London, UK: PMLR, 2022. 2730−2775
[96]	Siegel N Y, Springenberg J T, Berkenkamp F, Abdolmaleki A, Neunert M, Lampe T, et al. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv: 2002.08396, 2020.
[97]	Wang Z Y, Novikov A, Zolna K, Springenberg J T, Reed S, Shahriari B, et al. Critic regularized regression. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. 7768−7778
[98]	Emmons S, Eysenbach B, Kostrikov I, Levine S. RvS: What is essential for offline RL via supervised learning. arXiv preprint arXiv: 2112.10751, 2021.
[99]	Uchendu I, Xiao T, Lu Y, Zhu B H, Yan M Y, Simon J, et al. Jump-start reinforcement learning. arXiv preprint arXiv: 2204.02372, 2022.
[100]	Ray A, Achiam J, Amodei D. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv: 1910.01708, 2019.
[101]	Hawkins D. Constrained Optimization and Lagrange Multiplier Methods. Boston: Academic Press, 1982.
[102]	Yuan Z C, Hall A W, Zhou S Q, Brunke L, Greeff M, Panerati J, et al. Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning. arXiv preprint arXiv: 2109.06325, 2021.
[103]	Buchli J, Farshidian F, Winkler A, Sandy T, Giftthaler M. Optimal and learning control for autonomous robots. arXiv preprint arXiv: 1708.09342, 2017.
[104]	Rawlings J B, Mayne D Q, Diehl M M. Model Predictive Control: Theory, Computation, and Design. Madison, Wisconsi: Nob Hill Publishing, 2017.
[105]	Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the International Conference on Machine Learning. Stockholm, Sweden: PMLR, 2018. 1861−1870
[106]	Hewing L, Kabzan J, Zeilinger M N. Cautious model predictive control using Gaussian process regression. IEEE Transactions on Control Systems Technology, 2020, 28(6): 2736-2743 doi: 10.1109/TCST.2019.2949757
[107]	Pinto L, Davidson J, Sukthankar R, Gupta A. Robust adversarial reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017. 2817−2826
[108]	Vinitsky E, Du Y Q, Parvate K, Jang K, Abbeel P, Bayen A. Robust reinforcement learning using adversarial populations. arXiv preprint arXiv: 2008.01825, 2020.
[109]	Wabersich K P, Zeilinger M N. Linear model predictive safety certification for learning-based control. In: Proceedings of the IEEE Conference on Decision and Control (CDC). Florida, USA: IEEE, 2018. 7130−7135
[110]	Ames A D, Coogan S, Egerstedt M, Notomista G, Sreenath K, Tabuada P. Control barrier functions: Theory and applications. In: Proceedings of the 18th European Control Conference (ECC). Naples, Italy: IEEE, 2019. 3420−3431
[111]	Yang L, Ji L M, Dai J T, Zhang L R, Zhou B B, Li P F, et al. Constrained update projection approach to safe policy optimization. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022. 9111−9124
[112]	Li Q Y, Peng Z H, Feng L, Zhang Q H, Xue Z H, Zhou B L. MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3461-3475
[113]	Ha S, Xu P, Tan Z Y, Levine S, Tan J. Learning to walk in the real world with minimal human effort. arXiv preprint arXiv: 2002.08550, 2020.
[114]	Fu J, Kumar A, Nachum O, Tucker G, Levine S. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv: 2004.07219, 2020.
[115]	Wu Y F, Tucker G, Nachum O. Behavior regularized offline reinforcement learning. arXiv preprint arXiv: 1911.11361, 2019.
[116]	Peng X B, Kumar A, Zhang G, Levine S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv: 1910.00177, 2019.
[117]	Nachum O, Dai B, Kostrikov I, Chow Y, Li L H, Schuurmans D. AlgaeDICE: Policy gradient from arbitrary experience. arXiv preprint arXiv: 1912.02074, 2019.
[118]	Qin R J, Gao S Y, Zhang X Y, Xu Z, Huang S K, Li Z W, et al. NeoRL: A near real-world benchmark for offline reinforcement learning. arXiv preprint arXiv: 2102.00714, 2021.
[119]	Matsushima T, Furuta H, Matsuo Y, Nachum O, Gu S X. Deployment-efficient reinforcement learning via model-based offline optimization. arXiv preprint arXiv: 2006.03647, 2020.
[120]	Yu T H, Thomas G, Yu L T, Ermon S, Zou J, Levine S, et al. MOPO: Model-based offline policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020. 14129−14142
[121]	Brunke L, Greeff M, Hall A W, Yuan Z C, Zhou S Q, Panerati J, et al. Safe learning in robotics: From learning-based control to safe reinforcement learning. arXiv preprint arXiv: 2108.06266, 2021.
[122]	Chen L L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, et al. Decision transformer: Reinforcement learning via sequence modeling. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Sydney, Australia: Curran Associates Inc., 2021. 15084−15097
[123]	Janner M, Li Q Y, Levine S. Offline reinforcement learning as one big sequence modeling problem. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. Sydney, Australia: Curran Associates Inc., 2021. 1273−1286