深度Q学习的二次主动采样方法

赵英男; 刘鹏; 赵巍; 唐降龙

doi:10.16383/j.aas.2018.c170635

深度Q学习的二次主动采样方法

doi: 10.16383/j.aas.2018.c170635

赵英男^1,,
刘鹏^1,,
赵巍^1, ,,
唐降龙^1,

1.
哈尔滨工业大学计算机科学与技术学院模式识别与智能系统研究中心哈尔滨 150001

基金项目:

国家自然科学基金 61671175

国家自然科学基金 61672190

详细信息

作者简介:
赵英男  哈尔滨工业大学计算机科学与技术学院博士研究生.2017年获得哈尔滨工业大学计算机科学与技术硕士学位.主要研究方向为强化学习, 机器学习.E-mail:ynzhao_rl@163.com

刘鹏  哈尔滨工业大学计算机科学与技术学院副教授.2007年获得哈尔滨工业大学微电子与固体电子学博士学位.主要研究方向为图像处理, 视频分析, 模式识别, 超大规模集成电路设计.E-mail:pengliu@hit.edu.cn

唐降龙  哈尔滨工业大学计算机科学与技术学院教授.1995年获得哈尔滨工业大学计算机应用技术博士学位.主要研究方向为模式识别, 图像处理, 机器学习.E-mail:tangxl@hit.edu.cn

通讯作者:
赵巍哈尔滨工业大学计算机科学与技术学院副教授.曾获黑龙江省科技进步一等奖.主要研究方向为模式识别, 机器学习, 计算机视觉.本文通信作者.E-mail:zhaowei@hit.edu.cn

计量
- 文章访问数: 1661
- HTML全文浏览量: 826
- PDF下载量: 204
- 被引次数: 0
出版历程
- 收稿日期: 2017-11-13
- 录用日期: 2018-04-16
- 刊出日期: 2019-10-20

Twice Sampling Method in Deep Q-network

1.
Pattern Recognition and Intelligent System Research Center, School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001

Funds:

National Natural Science Foundation of China 61671175

National Natural Science Foundation of China 61672190

More Information

Author Bio:
Ph. D. candidate at the School of Computer Science and Technology, Harbin Institute of Technology. He received his master degree in computer science and technology from Harbin Institute of Technology in 2017. His research interest covers reinforcement learning and machine learning

Associate professor at the School of Computer Science and Technology, Harbin Institute of Technology. He received his Ph. D. degree in microelectronics and solid state electronics from Harbin Institute of Technology in 2007. His research interest covers image processing, video analysis, pattern recognition, and design of very large scale integrated (VLSI) circuit

Professor at the School of Computer Science and Technology, Harbin Institute of Technology. He received his Ph. D. degree of computer application technology from Harbin Institute of Technology in 1995. His research interest covers pattern recognition, image processing, and machine learning

Corresponding author: ZHAO Wei Associate professor at the School of Computer Science and Technology, Harbin Institute of Technology. She won a First Prize of Heilongjiang Province Science and Technology Progress. Her research interest covers pattern recognition, machine learning, and computer vision. Corresponding author of this paper

摘要

摘要: 实现深度Q学习的一种方式是深度Q网络（Deep Q-networks，DQN）.经验回放方法利用经验池中的样本训练深度Q网络，构造经验池需要智能体与环境进行大量交互，这样会增加成本和风险.一种减少智能体与环境交互次数的有效方式是高效利用样本.样本所在序列的累积回报对深度Q网络训练有影响.累积回报大的序列中的样本相对于累积回报小的序列中的样本更能加速深度Q网络的收敛速度，并提升策略的质量.本文提出深度Q学习的二次主动采样方法.首先，根据序列累积回报的分布构造优先级对经验池中的序列进行采样.然后，在已采样的序列中根据样本的TD-error（Temporal-difference error）分布构造优先级对样本采样.随后用两次采样得到的样本训练深度Q网络.该方法从序列累积回报和TD-error两个方面选择样本，以达到加速深度Q网络收敛，提升策略质量的目的.在Atari平台上进行了验证.实验结果表明，用经过二次主动采样得到的样本训练深度Q网络取得了良好的效果.
- 优先经验回放 /
- TD-error /
- 深度Q网络 /
- 累积回报
Abstract: One way of implementing the deep Q-learning is the deep Q-networks (DQN). Experience replay is known to train deep Q-networks by reusing transitions from a replay memory. However, an agent needs to interact with the environment lots of times to construct the replay memory, which will increase the cost and risk. To reduce the times of interaction, one way is to use the transitions more efficiently. The cumulative reward of an episode where one transition is obtained from has an impact on the training of DQN. If a transition is obtained from the episode which can get a big cumulative reward, it can accelerate the convergence of DQN and improve the best policy compared with the transition which is obtained from a small cumulative reward's episode. In this paper, we develop a framework for twice active sampling method in the deep Q-learning. First of all, we sample the episodes from the replay memory based on their cumulative reward. Then we sample the transitions from the selected episodes based on their temporal-difference error (TD-error). In the end, we train the DQN with these transitions. The method proposed in this paper not only accelerates the convergence of the deep Q-learning, but also leads to better policies because we replay transitions based on both TD-error and cumulative reward. By analyzing the results on Atari games, the experiments have shown that our method can achieve good results.
- Prioritized experience replay /
- temporal-difference error (TD-error) /
- deep Q-networks (DQN) /
- cumulative reward
注释:

1) 本文责任编委魏庆来

HTML全文

图 1 平衡杆环境示意图

Fig. 1 The diagram of cartpole

下载: 全尺寸图片幻灯片

图 2 平衡杆在三种情况下的对比实验

Fig. 2 The cartpole comparison experiments in three cases

下载: 全尺寸图片幻灯片

图 3 Atari游戏截图

Fig. 3 Screenshots of some games

下载: 全尺寸图片幻灯片

图 4 $\epsilon$值变化曲线

Fig. 4 The value of $\epsilon$

下载: 全尺寸图片幻灯片

图 5 蜈蚣游戏训练曲线

Fig. 5 The training curve of centipede

下载: 全尺寸图片幻灯片

图 6 经验池样本分布图

Fig. 6 The distribution map of samples

下载: 全尺寸图片幻灯片

图 7 Atari游戏训练曲线

Fig. 7 The training curves of Atari games

下载: 全尺寸图片幻灯片

图 8 Riverraid游戏截图

Fig. 8 Screenshots of Riverraid

下载: 全尺寸图片幻灯片

表 1 平衡杆问题在不同采样序列数量下的平均运行时间和平均收敛步数

Table 1 Average convergent step numbers and consuming time using different sampling episode numbers in cartpole

采样序列数量	平均运行时间(s)	平均收敛步数
8	455.7025	79 251.0
16	491.0820	71 950.0
24	498.1949	69 188.8
32	527.1340	68 543.8
40	541.2012	63 389.2
48	567.1340	64 344.3

下载: 导出CSV

表 2 样本选择顺序对比实验(平衡杆)

Table 2 The comparison experiments in different sampling order (cartpole)

样本选择顺序	10次实验中的未收敛次数	平均收敛步数
累积回报-TD-error	1	61 024.0
TD-error-累积回报	5	63 010.0

下载: 导出CSV

表 3 全部游戏的规约得分总体统计表

Table 3 Summary of normalized score on all games

	DDQN^[24]	PER^[20]	本文方法
平均数	221.14 %	300.16 %	357.27 %

下载: 导出CSV

表 4 全部游戏实际得分统计结果

Table 4 Scores on 12 Atari games with no-ops evaluation

游戏名称	随机智能体	人类专家	DDQN	PER	本文方法
Alien	227.80	6 875.40	2 907.30	3 310.80	3 692.30
Asteroids	719.10	13 156.70	930.60	1 699.30	1 927.30
Bank Heist	14.20	734.40	728.30	1 126.80	1 248.20
Breakout	1.70	31.80	403.00	381.50	533.01
Centipede	2 090.90	11 963.20	4 139.40	5 175.40	5 691.30
Crazy Climber	10 780.50	35 410.50	101 874.00	183 137.00	185 513.70
MsPacman	307.30	15 693.40	3 210.00	4 751.2	5 313.90
Phoenix	761	7 242.6	12 252.5	32 808.3	39 427.4
Pong	-20.70	9.30	21.00	20.70	21.00
Private Eye	24.90	69 571.30	129.70	200.00	265.00
Riverraid	1 338.50	13 513.30	12 015.30	20 494.00	14 231.70
Robotank	2.20	11.90	62.70	58.60	66.70

下载: 导出CSV

表 5 全部游戏规约得分统计结果

Table 5 Normalized scores on 12 Atari games

游戏名称	DDQN	PER	本文方法
Alien	40.31 %	46.38 %	52.12 %
Asteroids	1.70 %	7.8 %	9.71 %
Bank Heist	99.15 %	154.48 %	171.34 %
Breakout	1 333.22 %	1 261.79 %	1 765.12 %
Centipede	20.75 %	31.24 %	36.47 %
Crazy Climber	369.85 %	699.78 %	709.43 %
MsPacman	18.87 %	28.88 %	32.54 %
Phoenix	177.30 %	494.40 %	596.59 %
Pong	139.00 %	138.00 %	139.00 %
Private Eye	0.15 %	0.25 %	0.35 %
Riverraid	87.7 %	157.34 %	105.90 %
Robotank	623.71 %	581.44 %	664.95 %

下载: 导出CSV

参考文献(40)

[1]	Sutton R S, Barto A. Reinforcement Learning:An Introduction 2nd(Draft). MIT Press, 2017.
[2]	Glimcher P W. Understanding dopamine and reinforcement learning:the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, 2011, 108(Supplement 3):15647-15654 http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ0223400386/
[3]	Kober J, Bagnell J A, Peters J. Reinforcement learning in robotics:a survey. The International Journal of Robotics Research, 2013, 32(11):1238-1274 doi: 10.1177/0278364913495721
[4]	高阳, 陈世福, 陆鑫.强化学习研究综述.自动化学报, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml Gao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning technology:a review. Acta Automatica Sinica, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml
[5]	Watkins C J C H, Dayan P. Q-learning. Machine learning, 1992, 8(3-4):279-292 doi: 10.1007/BF00992698
[6]	Sutton R S. Learning to predict by the methods of temporal differences. Machine learning, 1988, 3(1):9-44 http://d.old.wanfangdata.com.cn/OAPaper/oai_arXiv.org_1110.2416
[7]	Sutton R S, McAllester D A, Singh S P, et al. Policy gradient methods for reinforcement learning with function approximation. In:Advances in neural information processing systems. Denver, United States:MIT Press, 2000. 1057-1063
[8]	Konda V R, Tsitsiklis J N. Actor-critic algorithms. In:Advances in neural information processing systems. Denver, United States:MIT Press, 2000. 1008-1014
[9]	Tesauro G. Td-gammon:A self-teaching backgammon program. Applications of Neural Networks, 1995. 267-285 http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ029367329/
[10]	LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553):436-444 doi: 10.1038/nature14539
[11]	Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press, 2016
[12]	郭潇逍, 李程, 梅俏竹.深度学习在游戏中的应用.自动化学报, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtml Guo Xiao-Xiao, Li Cheng, Mei Qiao-Zhu. Deep Learning Applied to Games. Acta Automatica Sinica, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtml
[13]	Browne C B, Powley E, Whitehouse D, et al. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 2012, 4(1):1-43 doi: 10.1109/TCIAIG.2012.2186810
[14]	Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587):484-489 doi: 10.1038/nature16961
[15]	田渊栋.阿法狗围棋系统的简要分析.自动化学报, 2016, 42(5):671-675 http://www.aas.net.cn/CN/abstract/abstract18856.shtml Tian Yuan-Dong. A simple analysis of AlphaGo. Acta Automatica Sinica, 2016, 42(5):671-675 http://www.aas.net.cn/CN/abstract/abstract18856.shtml
[16]	陈兴国, 俞扬.强化学习及其在电脑围棋中的应用.自动化学报, 2016, 42(5):685-695 http://www.aas.net.cn/CN/abstract/abstract18858.shtml Chen Xing-Guo, Yu Yang. Reinforcement Learning and Its Application to the Game of Go. Acta Automatica Sinica, 2016, 42(5):685-695 http://www.aas.net.cn/CN/abstract/abstract18858.shtml
[17]	Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540):529-533 doi: 10.1038/nature14236
[18]	Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning. arXiv preprint, arXiv, 2013. 1312.5602
[19]	Mirowski P, Pascanu R, Viola F, et al. Learning to navigate in complex environments. arXiv preprint, arXiv, 2016. 1611.03673 https://arxiv.org/abs/1611.03673
[20]	Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model. Interspeech, 2010, 2:3 http://d.old.wanfangdata.com.cn/Periodical/mssbyrgzn201504002
[21]	He D, Xia Y, Qin T, et al. Dual learning for machine translation. In:Advances in Neural Information Processing Systems. Barcelona, Spain:Mit Press, 2016. 820-828 https://arxiv.org/abs/1611.00179
[22]	Silver D, Lever G, Heess N, et al. Deterministic policy gradient algorithms. In:Proceedings of the 31st international conference on machine learning (ICML-14). Beijing, China:ACM, 2014. 387-395
[23]	Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning. arXiv preprint, arXiv, 2015. 1509.02971
[24]	Li Y. Deep reinforcement learning:An overview. arXiv preprint, arXiv, 2017. 1701.07274
[25]	Baird L. Residual algorithms:Reinforcement learning with function approximation. In:Proceedings of the 12th international conference on machine learning. Tahoe City, United States:ACM, 1995. 30-37 https://www.sciencedirect.com/science/article/pii/B978155860377650013X
[26]	Taylor M E, Stone P. Transfer learning for reinforcement learning domains:A survey. Journal of Machine Learning Research, 2009, 10(Jul):1633-1685 http://www.doc88.com/p-9022325630063.html
[27]	Yin H, Pan S J. Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay. In:Proceedings of the 31st AAAI Conf on Artificial Intelligence. Menlo Park, United States:AAAI, 2017. 1640-1646 https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14478
[28]	Glatt R, Costa A H R. Policy Reuse in Deep Reinforcement Learning. In:Proceedings of the 31st AAAI Conf on Artificial Intelligence. Menlo Park, United States:AAAI, 2017. 4929-4930
[29]	Sutton R S. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 1991, 2(4):160-163 doi: 10.1145/122344.122377
[30]	Deisenroth M, Rasmussen C E. PILCO:a model-based and data-efficient approach to policy search. In:Proceedings of the 28th International Conference on machine learning (ICML-11). Bellevue, United States:ACM, 2011. 465-472 https://www.mendeley.com/catalogue/pilco-modelbased-dataefficient-approach-policy-search/
[31]	Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay. arXiv preprint, arXiv, 2015, 1511.05952
[32]	Zhai J, Liu Q, Zhang Z, et al. Deep Q-learning with prioritized sampling. In:International Conference on Neural Information Processing. Kyoto, Japan:Springer, 2016. 13-22 doi: 10.1007%2F978-3-319-46687-3_2
[33]	Lin L H. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 1992, 8(3/4):69-97 doi: 10.1007-BF00992699/
[34]	Morton J. Deep Reinforcement Learning[Online], available:http://web.stanford.edu/class/aa228/drl.pdf, April 18, 2018.
[35]	Van Hasselt H, Guez A, Silver D. Deep Reinforcement Learning with Double Q-Learning. In:Proceedings of the 30th AAAI Conference on Artificial Intelligence. Menlo Park, United States:AAAI, 2016. 2094-2100
[36]	Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv, 2015. 1511.06581
[37]	Dolan R J, Dayan P. Goals and habits in the brain. Neuron, 2013, 80(2):312-325 doi: 10.1016/j.neuron.2013.09.007
[38]	Zhao D, Wang H, Shao K, et al. Deep reinforcement learning with experience replay based on sarsa. Computational Intelligence (SSCI), 2016 IEEE Symposium Series on. IEEE, 2016. 1-6 https://ieeexplore.ieee.org/document/7849837
[39]	Wang Z, Bapst V, Heess N, et al. Sample efficient actor-critic with experience replay. arXiv preprint, arXiv, 2016. 1611.01224
[40]	Bellemare M G, Naddaf Y, Veness J, et al. The Arcade Learning Environment:An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 2013, 47:253-279 doi: 10.1613/jair.3912