2.765

2022影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

深度Q学习的二次主动采样方法

赵英男 刘鹏 赵巍 唐降龙

赵英男, 刘鹏, 赵巍, 唐降龙. 深度Q学习的二次主动采样方法. 自动化学报, 2019, 45(10): 1870-1882. doi: 10.16383/j.aas.2018.c170635
引用本文: 赵英男, 刘鹏, 赵巍, 唐降龙. 深度Q学习的二次主动采样方法. 自动化学报, 2019, 45(10): 1870-1882. doi: 10.16383/j.aas.2018.c170635
ZHAO Ying-Nan, LIU Peng, ZHAO Wei, TANG Xiang-Long. Twice Sampling Method in Deep Q-network. ACTA AUTOMATICA SINICA, 2019, 45(10): 1870-1882. doi: 10.16383/j.aas.2018.c170635
Citation: ZHAO Ying-Nan, LIU Peng, ZHAO Wei, TANG Xiang-Long. Twice Sampling Method in Deep Q-network. ACTA AUTOMATICA SINICA, 2019, 45(10): 1870-1882. doi: 10.16383/j.aas.2018.c170635

深度Q学习的二次主动采样方法

doi: 10.16383/j.aas.2018.c170635
基金项目: 

国家自然科学基金 61671175

国家自然科学基金 61672190

详细信息
    作者简介:

    赵英男  哈尔滨工业大学计算机科学与技术学院博士研究生.2017年获得哈尔滨工业大学计算机科学与技术硕士学位.主要研究方向为强化学习, 机器学习.E-mail:ynzhao_rl@163.com

    刘鹏  哈尔滨工业大学计算机科学与技术学院副教授.2007年获得哈尔滨工业大学微电子与固体电子学博士学位.主要研究方向为图像处理, 视频分析, 模式识别, 超大规模集成电路设计.E-mail:pengliu@hit.edu.cn

    唐降龙  哈尔滨工业大学计算机科学与技术学院教授.1995年获得哈尔滨工业大学计算机应用技术博士学位.主要研究方向为模式识别, 图像处理, 机器学习.E-mail:tangxl@hit.edu.cn

    通讯作者:

    赵巍  哈尔滨工业大学计算机科学与技术学院副教授.曾获黑龙江省科技进步一等奖.主要研究方向为模式识别, 机器学习, 计算机视觉.本文通信作者.E-mail:zhaowei@hit.edu.cn

Twice Sampling Method in Deep Q-network

Funds: 

National Natural Science Foundation of China 61671175

National Natural Science Foundation of China 61672190

More Information
    Author Bio:

    Ph. D. candidate at the School of Computer Science and Technology, Harbin Institute of Technology. He received his master degree in computer science and technology from Harbin Institute of Technology in 2017. His research interest covers reinforcement learning and machine learning

    Associate professor at the School of Computer Science and Technology, Harbin Institute of Technology. He received his Ph. D. degree in microelectronics and solid state electronics from Harbin Institute of Technology in 2007. His research interest covers image processing, video analysis, pattern recognition, and design of very large scale integrated (VLSI) circuit

    Professor at the School of Computer Science and Technology, Harbin Institute of Technology. He received his Ph. D. degree of computer application technology from Harbin Institute of Technology in 1995. His research interest covers pattern recognition, image processing, and machine learning

    Corresponding author: ZHAO Wei Associate professor at the School of Computer Science and Technology, Harbin Institute of Technology. She won a First Prize of Heilongjiang Province Science and Technology Progress. Her research interest covers pattern recognition, machine learning, and computer vision. Corresponding author of this paper
  • 摘要: 实现深度Q学习的一种方式是深度Q网络(Deep Q-networks,DQN).经验回放方法利用经验池中的样本训练深度Q网络,构造经验池需要智能体与环境进行大量交互,这样会增加成本和风险.一种减少智能体与环境交互次数的有效方式是高效利用样本.样本所在序列的累积回报对深度Q网络训练有影响.累积回报大的序列中的样本相对于累积回报小的序列中的样本更能加速深度Q网络的收敛速度,并提升策略的质量.本文提出深度Q学习的二次主动采样方法.首先,根据序列累积回报的分布构造优先级对经验池中的序列进行采样.然后,在已采样的序列中根据样本的TD-error(Temporal-difference error)分布构造优先级对样本采样.随后用两次采样得到的样本训练深度Q网络.该方法从序列累积回报和TD-error两个方面选择样本,以达到加速深度Q网络收敛,提升策略质量的目的.在Atari平台上进行了验证.实验结果表明,用经过二次主动采样得到的样本训练深度Q网络取得了良好的效果.
    1)  本文责任编委 魏庆来
  • 图  1  平衡杆环境示意图

    Fig.  1  The diagram of cartpole

    图  2  平衡杆在三种情况下的对比实验

    Fig.  2  The cartpole comparison experiments in three cases

    图  3  Atari游戏截图

    Fig.  3  Screenshots of some games

    图  4  $\epsilon$值变化曲线

    Fig.  4  The value of $\epsilon$

    图  5  蜈蚣游戏训练曲线

    Fig.  5  The training curve of centipede

    图  6  经验池样本分布图

    Fig.  6  The distribution map of samples

    图  7  Atari游戏训练曲线

    Fig.  7  The training curves of Atari games

    图  8  Riverraid游戏截图

    Fig.  8  Screenshots of Riverraid

    表  1  平衡杆问题在不同采样序列数量下的平均运行时间和平均收敛步数

    Table  1  Average convergent step numbers and consuming time using different sampling episode numbers in cartpole

    采样序列数量 平均运行时间(s) 平均收敛步数
    8 455.7025 79 251.0
    16 491.0820 71 950.0
    24 498.1949 69 188.8
    32 527.1340 68 543.8
    40 541.2012 63 389.2
    48 567.1340 64 344.3
    下载: 导出CSV

    表  2  样本选择顺序对比实验(平衡杆)

    Table  2  The comparison experiments in different sampling order (cartpole)

    样本选择顺序 10次实验中的未收敛次数 平均收敛步数
    累积回报-TD-error 1 61 024.0
    TD-error-累积回报 5 63 010.0
    下载: 导出CSV

    表  3  全部游戏的规约得分总体统计表

    Table  3  Summary of normalized score on all games

    DDQN[24] PER[20] 本文方法
    平均数 221.14 % 300.16 % 357.27 %
    下载: 导出CSV

    表  4  全部游戏实际得分统计结果

    Table  4  Scores on 12 Atari games with no-ops evaluation

    游戏名称 随机智能体 人类专家 DDQN PER 本文方法
    Alien 227.80 6 875.40 2 907.30 3 310.80 3 692.30
    Asteroids 719.10 13 156.70 930.60 1 699.30 1 927.30
    Bank Heist 14.20 734.40 728.30 1 126.80 1 248.20
    Breakout 1.70 31.80 403.00 381.50 533.01
    Centipede 2 090.90 11 963.20 4 139.40 5 175.40 5 691.30
    Crazy Climber 10 780.50 35 410.50 101 874.00 183 137.00 185 513.70
    MsPacman 307.30 15 693.40 3 210.00 4 751.2 5 313.90
    Phoenix 761 7 242.6 12 252.5 32 808.3 39 427.4
    Pong -20.70 9.30 21.00 20.70 21.00
    Private Eye 24.90 69 571.30 129.70 200.00 265.00
    Riverraid 1 338.50 13 513.30 12 015.30 20 494.00 14 231.70
    Robotank 2.20 11.90 62.70 58.60 66.70
    下载: 导出CSV

    表  5  全部游戏规约得分统计结果

    Table  5  Normalized scores on 12 Atari games

    游戏名称 DDQN PER 本文方法
    Alien 40.31 % 46.38 % 52.12 %
    Asteroids 1.70 % 7.8 % 9.71 %
    Bank Heist 99.15 % 154.48 % 171.34 %
    Breakout 1 333.22 % 1 261.79 % 1 765.12 %
    Centipede 20.75 % 31.24 % 36.47 %
    Crazy Climber 369.85 % 699.78 % 709.43 %
    MsPacman 18.87 % 28.88 % 32.54 %
    Phoenix 177.30 % 494.40 % 596.59 %
    Pong 139.00 % 138.00 % 139.00 %
    Private Eye 0.15 % 0.25 % 0.35 %
    Riverraid 87.7 % 157.34 % 105.90 %
    Robotank 623.71 % 581.44 % 664.95 %
    下载: 导出CSV
  • [1] Sutton R S, Barto A. Reinforcement Learning:An Introduction 2nd(Draft). MIT Press, 2017.
    [2] Glimcher P W. Understanding dopamine and reinforcement learning:the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, 2011, 108(Supplement 3):15647-15654 http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ0223400386/
    [3] Kober J, Bagnell J A, Peters J. Reinforcement learning in robotics:a survey. The International Journal of Robotics Research, 2013, 32(11):1238-1274 doi: 10.1177/0278364913495721
    [4] 高阳, 陈世福, 陆鑫.强化学习研究综述.自动化学报, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml

    Gao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning technology:a review. Acta Automatica Sinica, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml
    [5] Watkins C J C H, Dayan P. Q-learning. Machine learning, 1992, 8(3-4):279-292 doi: 10.1007/BF00992698
    [6] Sutton R S. Learning to predict by the methods of temporal differences. Machine learning, 1988, 3(1):9-44 http://d.old.wanfangdata.com.cn/OAPaper/oai_arXiv.org_1110.2416
    [7] Sutton R S, McAllester D A, Singh S P, et al. Policy gradient methods for reinforcement learning with function approximation. In:Advances in neural information processing systems. Denver, United States:MIT Press, 2000. 1057-1063
    [8] Konda V R, Tsitsiklis J N. Actor-critic algorithms. In:Advances in neural information processing systems. Denver, United States:MIT Press, 2000. 1008-1014
    [9] Tesauro G. Td-gammon:A self-teaching backgammon program. Applications of Neural Networks, 1995. 267-285 http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ029367329/
    [10] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553):436-444 doi: 10.1038/nature14539
    [11] Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press, 2016
    [12] 郭潇逍, 李程, 梅俏竹.深度学习在游戏中的应用.自动化学报, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtml

    Guo Xiao-Xiao, Li Cheng, Mei Qiao-Zhu. Deep Learning Applied to Games. Acta Automatica Sinica, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtml
    [13] Browne C B, Powley E, Whitehouse D, et al. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 2012, 4(1):1-43 doi: 10.1109/TCIAIG.2012.2186810
    [14] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587):484-489 doi: 10.1038/nature16961
    [15] 田渊栋.阿法狗围棋系统的简要分析.自动化学报, 2016, 42(5):671-675 http://www.aas.net.cn/CN/abstract/abstract18856.shtml

    Tian Yuan-Dong. A simple analysis of AlphaGo. Acta Automatica Sinica, 2016, 42(5):671-675 http://www.aas.net.cn/CN/abstract/abstract18856.shtml
    [16] 陈兴国, 俞扬.强化学习及其在电脑围棋中的应用.自动化学报, 2016, 42(5):685-695 http://www.aas.net.cn/CN/abstract/abstract18858.shtml

    Chen Xing-Guo, Yu Yang. Reinforcement Learning and Its Application to the Game of Go. Acta Automatica Sinica, 2016, 42(5):685-695 http://www.aas.net.cn/CN/abstract/abstract18858.shtml
    [17] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540):529-533 doi: 10.1038/nature14236
    [18] Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning. arXiv preprint, arXiv, 2013. 1312.5602
    [19] Mirowski P, Pascanu R, Viola F, et al. Learning to navigate in complex environments. arXiv preprint, arXiv, 2016. 1611.03673 https://arxiv.org/abs/1611.03673
    [20] Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model. Interspeech, 2010, 2:3 http://d.old.wanfangdata.com.cn/Periodical/mssbyrgzn201504002
    [21] He D, Xia Y, Qin T, et al. Dual learning for machine translation. In:Advances in Neural Information Processing Systems. Barcelona, Spain:Mit Press, 2016. 820-828 https://arxiv.org/abs/1611.00179
    [22] Silver D, Lever G, Heess N, et al. Deterministic policy gradient algorithms. In:Proceedings of the 31st international conference on machine learning (ICML-14). Beijing, China:ACM, 2014. 387-395
    [23] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning. arXiv preprint, arXiv, 2015. 1509.02971
    [24] Li Y. Deep reinforcement learning:An overview. arXiv preprint, arXiv, 2017. 1701.07274
    [25] Baird L. Residual algorithms:Reinforcement learning with function approximation. In:Proceedings of the 12th international conference on machine learning. Tahoe City, United States:ACM, 1995. 30-37 https://www.sciencedirect.com/science/article/pii/B978155860377650013X
    [26] Taylor M E, Stone P. Transfer learning for reinforcement learning domains:A survey. Journal of Machine Learning Research, 2009, 10(Jul):1633-1685 http://www.doc88.com/p-9022325630063.html
    [27] Yin H, Pan S J. Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay. In:Proceedings of the 31st AAAI Conf on Artificial Intelligence. Menlo Park, United States:AAAI, 2017. 1640-1646 https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14478
    [28] Glatt R, Costa A H R. Policy Reuse in Deep Reinforcement Learning. In:Proceedings of the 31st AAAI Conf on Artificial Intelligence. Menlo Park, United States:AAAI, 2017. 4929-4930
    [29] Sutton R S. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 1991, 2(4):160-163 doi: 10.1145/122344.122377
    [30] Deisenroth M, Rasmussen C E. PILCO:a model-based and data-efficient approach to policy search. In:Proceedings of the 28th International Conference on machine learning (ICML-11). Bellevue, United States:ACM, 2011. 465-472 https://www.mendeley.com/catalogue/pilco-modelbased-dataefficient-approach-policy-search/
    [31] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay. arXiv preprint, arXiv, 2015, 1511.05952
    [32] Zhai J, Liu Q, Zhang Z, et al. Deep Q-learning with prioritized sampling. In:International Conference on Neural Information Processing. Kyoto, Japan:Springer, 2016. 13-22 doi: 10.1007%2F978-3-319-46687-3_2
    [33] Lin L H. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 1992, 8(3/4):69-97 doi: 10.1007-BF00992699/
    [34] Morton J. Deep Reinforcement Learning[Online], available:http://web.stanford.edu/class/aa228/drl.pdf, April 18, 2018.
    [35] Van Hasselt H, Guez A, Silver D. Deep Reinforcement Learning with Double Q-Learning. In:Proceedings of the 30th AAAI Conference on Artificial Intelligence. Menlo Park, United States:AAAI, 2016. 2094-2100
    [36] Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv, 2015. 1511.06581
    [37] Dolan R J, Dayan P. Goals and habits in the brain. Neuron, 2013, 80(2):312-325 doi: 10.1016/j.neuron.2013.09.007
    [38] Zhao D, Wang H, Shao K, et al. Deep reinforcement learning with experience replay based on sarsa. Computational Intelligence (SSCI), 2016 IEEE Symposium Series on. IEEE, 2016. 1-6 https://ieeexplore.ieee.org/document/7849837
    [39] Wang Z, Bapst V, Heess N, et al. Sample efficient actor-critic with experience replay. arXiv preprint, arXiv, 2016. 1611.01224
    [40] Bellemare M G, Naddaf Y, Veness J, et al. The Arcade Learning Environment:An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 2013, 47:253-279 doi: 10.1613/jair.3912
  • 加载中
图(8) / 表(5)
计量
  • 文章访问数:  1397
  • HTML全文浏览量:  568
  • PDF下载量:  178
  • 被引次数: 0
出版历程
  • 收稿日期:  2017-11-13
  • 录用日期:  2018-04-16
  • 刊出日期:  2019-10-20

目录

    /

    返回文章
    返回