胡子剑 高晓光 万开方 张乐天 汪强龙 NERETIN Evgeny

胡子剑, 高晓光, 万开方, 张乐天, 汪强龙, NERETIN Evgeny. 异策略深度强化学习中的经验回放研究综述. 自动化学报, 2023, 49(11): 2237−2256 doi: 10.16383/j.aas.c220648
Hu Zi-Jian, Gao Xiao-Guang, Wan Kai-Fang, Zhang Le-Tian, Wang Qiang-Long, Neretin Evgeny. Research on experience replay of off-policy deep reinforcement learning: A review. Acta Automatica Sinica, 2023, 49(11): 2237−2256 doi: 10.16383/j.aas.c220648
基金项目: 国家自然科学基金(62003267, 61573285), 中央高校基本科研业务费专项资金(G2022KY0602), 电磁空间作战与应用重点实验室(2022ZX0090), 西安市科技计划项目 —— 关键核心技术攻关工程项目计划(21RGZN0016), 陕西省重点研发计划项目(2023-GHZD-33)资助

    胡子剑:西北工业大学电子信息学院博士研究生. 2018 年获得西北工业大学探测制导与控制技术学士学位. 主要研究方向为强化学习理论与应用. E-mail: huzijian@mail.nwpu.edu.cn

    高晓光:西北工业大学电子信息学院教授. 1989 年获得西北工业大学系统工程博士学位. 主要研究方向为机器学习理论, 贝叶斯网络理论和多智能体控制应用. E-mail: cxg2012@nwpu.edu.cn

    万开方:西北工业大学电子信息学院副研究员. 2016 年获得西北工业大学系统工程博士学位. 主要研究方向为多智能体理论, 近似动态规划和强化学习. 本文通信作者. E-mail: wankaifang@nwpu.edu.cn

    张乐天:西安电子科技大学外国语学院硕士研究生. 主要研究方向为科技翻译, 翻译理论和机器翻译. E-mail: 22091213382@stu.xidian.edu.cn

    汪强龙:西北工业大学电子信息学院博士研究生. 主要研究方向为深度学习, 强化学习. E-mail: wql1995@mail.nwpu.edu.cn

    NERETIN Evgeny:莫斯科航空学院教授. 2011年获得莫斯科航空学院技术科学博士学位. 主要研究方向为航空电子, 智能决策. E-mail: evgeny.neretin@gmail.com

Research on Experience Replay of Off-policy Deep Reinforcement Learning: A Review

Funds: Supported by National Natural Science Foundation of China (62003267, 61573285), the Fundamental Research Funds for the Central Universities (G2022KY0602), the Technology on Electromagnetic Space Operations and Applications Laboratory (2022ZX0090), the Key Core Technology Research Plan of Xi'an (21RGZN0016), and the Key Research and Development Program of Shaanxi Province (2023-GHZD-33)
More Information
    Author Bio:

    HU Zi-Jian Ph.D. candidate at the School of Electronics and Information, Northwestern Polytechnical University. He received his bachelor degree in detection guidance and control technology from Northwestern Polytechnical University in 2018. His research interest covers reinforcement learning theory and applications

    GAO Xiao-Guang Professor at the School of Electronics and Information, Northwestern Polytechnical University. She received her Ph.D. degree in system engineering from Northwestern Polytechnical University in 1989. Her research interest covers machine learning theory, Bayesian network theory, and multi-agent control application

    WAN Kai-Fang Associate researcher at the School of Electronics and Information, Northwestern Polytechnical University. He received his Ph.D. degree in system engineering from Northwestern Polytechnical University in 2016. His research interest covers multi-agent theory, approximate dynamic programming, and reinforcement learning. Corresponding author of this paper

    ZHANG Le-Tian Master student at the School of Foreign Languages, Xidian University. Her research interest covers scientific translation, translation theory, and machine translation

    WANG Qiang-Long Ph.D. candidate at the School of Electronics and Information, Northwestern Polytechnical University. His research interest covers deep learning and reinforcement learning

    NERETIN Evgeny Professor of Moscow Aviation Institute. He received his Ph.D. degree in technical sciences from Moscow Aviation Institute in 2011. His research interest covers avionics and intelligent decision-making

  • 摘要: 作为一种不需要事先获得训练数据的机器学习方法, 强化学习(Reinforcement learning, RL)在智能体与环境的不断交互过程中寻找最优策略, 是解决序贯决策问题的一种重要方法. 通过与深度学习(Deep learning, DL)结合, 深度强化学习(Deep reinforcement learning, DRL)同时具备了强大的感知和决策能力, 被广泛应用于多个领域来解决复杂的决策问题. 异策略强化学习通过将交互经验进行存储和回放, 将探索和利用分离开来, 更易寻找到全局最优解. 如何对经验进行合理高效的利用是提升异策略强化学习方法效率的关键. 首先对强化学习的基本理论进行介绍; 随后对同策略和异策略强化学习算法进行简要介绍; 接着介绍经验回放(Experience replay, ER)问题的两种主流解决方案, 包括经验利用和经验增广; 最后对相关的研究工作进行总结和展望.
  • 图  1  强化学习过程

    Fig.  1  The process of reinforcement learning

    图  2  强化学习算法分类

    Fig.  2  The classification of reinforcement learning algorithms

    图  3  DQN算法框架

    Fig.  3  The framework of DQN algorithm

    图  4  DDPG算法框架

    Fig.  4  The framework of DDPG algorithm

    图  5  异策略RL的经验回放流程

    Fig.  5  The experience replay process of off-policy RL

    图  6  经验回放分类

    Fig.  6  The classification of experience replay

    图  7  QER的算法框架

    Fig.  7  The framework of QER algorithm

    图  8  “sum-tree” 采样流程

    Fig.  8  The sampling process of “sum-tree”

    图  9  “double sum-tree” 数据结构

    Fig.  9  The data structure of “double sum-tree”

    图  10  模型经验增广算法的框架图

    Fig.  10  The framework of model experience augmentation algorithms

    表  1  同策略与异策略算法的优势对比

    Table  1  Comparison of advantages of on-policy and off-policy algorithms

    算法优势 同策略RL 异策略RL
    表  2  经验优先回放算法对比

    Table  2  Comparison of prioritized experience replay algorithms

    算法 优先回放指标 采样轮次
    PER[43], PSER[44], PPER[45] TD error 单轮
    HVPER[46] Q值, TD error 单轮
    TASM[47] 序列累计奖励, TD error 多轮
    AER[48] 相似性 多轮
    REL[49] TD error, 相似性 多轮
    KLPER[50] 批量经验策略的相似性 单轮
    DCRL[51] 经验难度, 采样次数 单轮
    ACER[54] 经验难度 单轮
    表  3  经验分类回放算法对比

    Table  3  Comparison of classification experience replay algorithms

    算法 分类标准 经验池形式 采样策略
    CER[59] 是否为当前经验 单经验池 + 临时存储 随机采样 + 当前经验
    ACER[54] 是否为最新经验 多经验池 优先采样 + 最新经验
    ReFER[60] 经验策略与当前策略的差异 单经验池 随机采样 + 经验过滤
    RC[61] 奖励 多经验池 静态采样
    TDC[61] TD error 多经验池 静态采样
    EPS[49] 基于场景的评价指标 多经验池 + 单经验池 静态采样
    CADP[62] TD error 多经验池 动态采样
    DDN-SDRL[63] 状态的危险程度 多经验池 静态采样
    表  4  经验存储结构算法的优化途径

    Table  4  Optimization approaches of experience storage structure algorithms

    表  5  专家示范经验算法对比

    Table  5  Comparison of expert demonstration experience algorithms

    LfOD[73]仿真平台专家网络 + 实际训练多经验池动态采样 + 优先采样自动路口管理
    IEP[74]人类示范专家网络 + 实际训练单经验池随机采样自动驾驶
    VD4[78]人类示范预训练 + 实际训练多经验池优先采样自主水下航行器控制
