-
摘要: 实现深度Q学习的一种方式是深度Q网络(Deep Q-networks,DQN).经验回放方法利用经验池中的样本训练深度Q网络,构造经验池需要智能体与环境进行大量交互,这样会增加成本和风险.一种减少智能体与环境交互次数的有效方式是高效利用样本.样本所在序列的累积回报对深度Q网络训练有影响.累积回报大的序列中的样本相对于累积回报小的序列中的样本更能加速深度Q网络的收敛速度,并提升策略的质量.本文提出深度Q学习的二次主动采样方法.首先,根据序列累积回报的分布构造优先级对经验池中的序列进行采样.然后,在已采样的序列中根据样本的TD-error(Temporal-difference error)分布构造优先级对样本采样.随后用两次采样得到的样本训练深度Q网络.该方法从序列累积回报和TD-error两个方面选择样本,以达到加速深度Q网络收敛,提升策略质量的目的.在Atari平台上进行了验证.实验结果表明,用经过二次主动采样得到的样本训练深度Q网络取得了良好的效果.Abstract: One way of implementing the deep Q-learning is the deep Q-networks (DQN). Experience replay is known to train deep Q-networks by reusing transitions from a replay memory. However, an agent needs to interact with the environment lots of times to construct the replay memory, which will increase the cost and risk. To reduce the times of interaction, one way is to use the transitions more efficiently. The cumulative reward of an episode where one transition is obtained from has an impact on the training of DQN. If a transition is obtained from the episode which can get a big cumulative reward, it can accelerate the convergence of DQN and improve the best policy compared with the transition which is obtained from a small cumulative reward's episode. In this paper, we develop a framework for twice active sampling method in the deep Q-learning. First of all, we sample the episodes from the replay memory based on their cumulative reward. Then we sample the transitions from the selected episodes based on their temporal-difference error (TD-error). In the end, we train the DQN with these transitions. The method proposed in this paper not only accelerates the convergence of the deep Q-learning, but also leads to better policies because we replay transitions based on both TD-error and cumulative reward. By analyzing the results on Atari games, the experiments have shown that our method can achieve good results.1) 本文责任编委 魏庆来
-
表 1 平衡杆问题在不同采样序列数量下的平均运行时间和平均收敛步数
Table 1 Average convergent step numbers and consuming time using different sampling episode numbers in cartpole
采样序列数量 平均运行时间(s) 平均收敛步数 8 455.7025 79 251.0 16 491.0820 71 950.0 24 498.1949 69 188.8 32 527.1340 68 543.8 40 541.2012 63 389.2 48 567.1340 64 344.3 表 2 样本选择顺序对比实验(平衡杆)
Table 2 The comparison experiments in different sampling order (cartpole)
样本选择顺序 10次实验中的未收敛次数 平均收敛步数 累积回报-TD-error 1 61 024.0 TD-error-累积回报 5 63 010.0 表 3 全部游戏的规约得分总体统计表
Table 3 Summary of normalized score on all games
表 4 全部游戏实际得分统计结果
Table 4 Scores on 12 Atari games with no-ops evaluation
游戏名称 随机智能体 人类专家 DDQN PER 本文方法 Alien 227.80 6 875.40 2 907.30 3 310.80 3 692.30 Asteroids 719.10 13 156.70 930.60 1 699.30 1 927.30 Bank Heist 14.20 734.40 728.30 1 126.80 1 248.20 Breakout 1.70 31.80 403.00 381.50 533.01 Centipede 2 090.90 11 963.20 4 139.40 5 175.40 5 691.30 Crazy Climber 10 780.50 35 410.50 101 874.00 183 137.00 185 513.70 MsPacman 307.30 15 693.40 3 210.00 4 751.2 5 313.90 Phoenix 761 7 242.6 12 252.5 32 808.3 39 427.4 Pong -20.70 9.30 21.00 20.70 21.00 Private Eye 24.90 69 571.30 129.70 200.00 265.00 Riverraid 1 338.50 13 513.30 12 015.30 20 494.00 14 231.70 Robotank 2.20 11.90 62.70 58.60 66.70 表 5 全部游戏规约得分统计结果
Table 5 Normalized scores on 12 Atari games
游戏名称 DDQN PER 本文方法 Alien 40.31 % 46.38 % 52.12 % Asteroids 1.70 % 7.8 % 9.71 % Bank Heist 99.15 % 154.48 % 171.34 % Breakout 1 333.22 % 1 261.79 % 1 765.12 % Centipede 20.75 % 31.24 % 36.47 % Crazy Climber 369.85 % 699.78 % 709.43 % MsPacman 18.87 % 28.88 % 32.54 % Phoenix 177.30 % 494.40 % 596.59 % Pong 139.00 % 138.00 % 139.00 % Private Eye 0.15 % 0.25 % 0.35 % Riverraid 87.7 % 157.34 % 105.90 % Robotank 623.71 % 581.44 % 664.95 % -
[1] Sutton R S, Barto A. Reinforcement Learning:An Introduction 2nd(Draft). MIT Press, 2017. [2] Glimcher P W. Understanding dopamine and reinforcement learning:the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, 2011, 108(Supplement 3):15647-15654 http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ0223400386/ [3] Kober J, Bagnell J A, Peters J. Reinforcement learning in robotics:a survey. The International Journal of Robotics Research, 2013, 32(11):1238-1274 doi: 10.1177/0278364913495721 [4] 高阳, 陈世福, 陆鑫.强化学习研究综述.自动化学报, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtmlGao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning technology:a review. Acta Automatica Sinica, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml [5] Watkins C J C H, Dayan P. Q-learning. Machine learning, 1992, 8(3-4):279-292 doi: 10.1007/BF00992698 [6] Sutton R S. Learning to predict by the methods of temporal differences. Machine learning, 1988, 3(1):9-44 http://d.old.wanfangdata.com.cn/OAPaper/oai_arXiv.org_1110.2416 [7] Sutton R S, McAllester D A, Singh S P, et al. Policy gradient methods for reinforcement learning with function approximation. In:Advances in neural information processing systems. Denver, United States:MIT Press, 2000. 1057-1063 [8] Konda V R, Tsitsiklis J N. Actor-critic algorithms. In:Advances in neural information processing systems. Denver, United States:MIT Press, 2000. 1008-1014 [9] Tesauro G. Td-gammon:A self-teaching backgammon program. Applications of Neural Networks, 1995. 267-285 http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ029367329/ [10] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553):436-444 doi: 10.1038/nature14539 [11] Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press, 2016 [12] 郭潇逍, 李程, 梅俏竹.深度学习在游戏中的应用.自动化学报, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtmlGuo Xiao-Xiao, Li Cheng, Mei Qiao-Zhu. Deep Learning Applied to Games. Acta Automatica Sinica, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtml [13] Browne C B, Powley E, Whitehouse D, et al. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 2012, 4(1):1-43 doi: 10.1109/TCIAIG.2012.2186810 [14] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587):484-489 doi: 10.1038/nature16961 [15] 田渊栋.阿法狗围棋系统的简要分析.自动化学报, 2016, 42(5):671-675 http://www.aas.net.cn/CN/abstract/abstract18856.shtmlTian Yuan-Dong. A simple analysis of AlphaGo. Acta Automatica Sinica, 2016, 42(5):671-675 http://www.aas.net.cn/CN/abstract/abstract18856.shtml [16] 陈兴国, 俞扬.强化学习及其在电脑围棋中的应用.自动化学报, 2016, 42(5):685-695 http://www.aas.net.cn/CN/abstract/abstract18858.shtmlChen Xing-Guo, Yu Yang. Reinforcement Learning and Its Application to the Game of Go. Acta Automatica Sinica, 2016, 42(5):685-695 http://www.aas.net.cn/CN/abstract/abstract18858.shtml [17] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540):529-533 doi: 10.1038/nature14236 [18] Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning. arXiv preprint, arXiv, 2013. 1312.5602 [19] Mirowski P, Pascanu R, Viola F, et al. Learning to navigate in complex environments. arXiv preprint, arXiv, 2016. 1611.03673 https://arxiv.org/abs/1611.03673 [20] Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model. Interspeech, 2010, 2:3 http://d.old.wanfangdata.com.cn/Periodical/mssbyrgzn201504002 [21] He D, Xia Y, Qin T, et al. Dual learning for machine translation. In:Advances in Neural Information Processing Systems. Barcelona, Spain:Mit Press, 2016. 820-828 https://arxiv.org/abs/1611.00179 [22] Silver D, Lever G, Heess N, et al. Deterministic policy gradient algorithms. In:Proceedings of the 31st international conference on machine learning (ICML-14). Beijing, China:ACM, 2014. 387-395 [23] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning. arXiv preprint, arXiv, 2015. 1509.02971 [24] Li Y. Deep reinforcement learning:An overview. arXiv preprint, arXiv, 2017. 1701.07274 [25] Baird L. Residual algorithms:Reinforcement learning with function approximation. In:Proceedings of the 12th international conference on machine learning. Tahoe City, United States:ACM, 1995. 30-37 https://www.sciencedirect.com/science/article/pii/B978155860377650013X [26] Taylor M E, Stone P. Transfer learning for reinforcement learning domains:A survey. Journal of Machine Learning Research, 2009, 10(Jul):1633-1685 http://www.doc88.com/p-9022325630063.html [27] Yin H, Pan S J. Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay. In:Proceedings of the 31st AAAI Conf on Artificial Intelligence. Menlo Park, United States:AAAI, 2017. 1640-1646 https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14478 [28] Glatt R, Costa A H R. Policy Reuse in Deep Reinforcement Learning. In:Proceedings of the 31st AAAI Conf on Artificial Intelligence. Menlo Park, United States:AAAI, 2017. 4929-4930 [29] Sutton R S. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 1991, 2(4):160-163 doi: 10.1145/122344.122377 [30] Deisenroth M, Rasmussen C E. PILCO:a model-based and data-efficient approach to policy search. In:Proceedings of the 28th International Conference on machine learning (ICML-11). Bellevue, United States:ACM, 2011. 465-472 https://www.mendeley.com/catalogue/pilco-modelbased-dataefficient-approach-policy-search/ [31] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay. arXiv preprint, arXiv, 2015, 1511.05952 [32] Zhai J, Liu Q, Zhang Z, et al. Deep Q-learning with prioritized sampling. In:International Conference on Neural Information Processing. Kyoto, Japan:Springer, 2016. 13-22 doi: 10.1007%2F978-3-319-46687-3_2 [33] Lin L H. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 1992, 8(3/4):69-97 doi: 10.1007-BF00992699/ [34] Morton J. Deep Reinforcement Learning[Online], available:http://web.stanford.edu/class/aa228/drl.pdf, April 18, 2018. [35] Van Hasselt H, Guez A, Silver D. Deep Reinforcement Learning with Double Q-Learning. In:Proceedings of the 30th AAAI Conference on Artificial Intelligence. Menlo Park, United States:AAAI, 2016. 2094-2100 [36] Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv, 2015. 1511.06581 [37] Dolan R J, Dayan P. Goals and habits in the brain. Neuron, 2013, 80(2):312-325 doi: 10.1016/j.neuron.2013.09.007 [38] Zhao D, Wang H, Shao K, et al. Deep reinforcement learning with experience replay based on sarsa. Computational Intelligence (SSCI), 2016 IEEE Symposium Series on. IEEE, 2016. 1-6 https://ieeexplore.ieee.org/document/7849837 [39] Wang Z, Bapst V, Heess N, et al. Sample efficient actor-critic with experience replay. arXiv preprint, arXiv, 2016. 1611.01224 [40] Bellemare M G, Naddaf Y, Veness J, et al. The Arcade Learning Environment:An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 2013, 47:253-279 doi: 10.1613/jair.3912