A Novel Reinforcement Learning Algorithm Based on Multilayer Memristive Spiking Neural Network With Applications
-
摘要: 人工神经网络(Artificial neural networks,ANNs)与强化学习算法的结合显著增强了智能体的学习能力和效率.然而,这些算法需要消耗大量的计算资源,且难以硬件实现.而脉冲神经网络(Spiking neural networks,SNNs)使用脉冲信号来传递信息,具有能量效率高、仿生特性强等特点,且有利于进一步实现强化学习的硬件加速,增强嵌入式智能体的自主学习能力.不过,目前脉冲神经网络的学习和训练过程较为复杂,网络设计和实现方面存在较大挑战.本文通过引入人工突触的理想实现元件——忆阻器,提出了一种硬件友好的基于多层忆阻脉冲神经网络的强化学习算法.特别地,设计了用于数据——脉冲转换的脉冲神经元;通过改进脉冲时间依赖可塑性(Spiking-timing dependent plasticity,STDP)规则,使脉冲神经网络与强化学习算法有机结合,并设计了对应的忆阻神经突触;构建了可动态调整的网络结构,以提高网络的学习效率;最后,以Open AI Gym中的CartPole-v0(倒立摆)和MountainCar-v0(小车爬坡)为例,通过实验仿真和对比分析,验证了方案的有效性和相对于传统强化学习方法的优势.
-
关键词:
- 强化学习 /
- 脉冲神经网络 /
- 脉冲时间依赖可塑性规则 /
- 忆阻器
Abstract: The combination of reinforcement learning algorithms with artificial neural networks (ANNs) enhances the learning ability of agents effectively. However, these algorithms consume a large number of computing resources, which are unfavourable for hardware implementation. Bionic spiking neural networks (SNNs) convey information by spikes and possess energy-efficient and hardware-friendly features. It is promising to accelerate reinforcement learning and develop embedded self-learning agents based on SNNs. Nevertheless, SNNs lack efficient learning algorithms and their training processes are really complex. As a result, it is challenging to design and implement SNNs. This paper proposes a hardware-friendly reinforcement learning algorithm based on an SNN by introducing famous artificial synapse element:memristor. Data-spike switching spiking neurons are designed especially. Then, we improve spiking-timing-dependent plasticity (STDP) rule to combine the SNN with reinforcement learning organically and the corresponding memristive synapses are created. Besides, the dynamic adjustable network structure is created to increase learning efficiency. Finally, a series of simulations show the effectiveness and advantages of the proposed scheme over conventional reinforcement learning algorithms in applications of CartPole-v0 and MountainCar-v0 in Open AI Gym environment.1) 本文责任编委 张敏灵 -
表 1 不同隐含层神经元数量TD方差对比
Table 1 The comparison of TD variance for difierent hidden neurons
任务 CartPole-v0 MountainCar-v0 $\rm Hidden = 1$ 27.14 5.17 $\rm Hidden = 2$ 24.52 5.03 $\rm Hidden = 4$ 21.2 4.96 $\rm Hidden = 6$ 19.45 4.87 $\rm Hidden = 10$ 17.26 4.79 $\rm Hidden = 12$ 14.04 4.65 表 2 比较结果(B)
Table 2 The results of comparison (B)
评价指标 平均迭代步数 平均分数 平均CPU利用率(%) 运行时间(s) MSRL (CartPole-v0) 98.93 1.28 12.0 3 528.38 DQN (CartPole-v0) 61.79 1.22 23.5 1 119.52 Q-learning (CartPole-v0) 11.83 1.14 0.3 105.60 MSRL (MountainCar-v0) 183.87 1.23 11.8 1 358.14 DQN (MountainCar-v0) 204.32 1.12 22.9 359.21 Q-learning (MountainCar-v0) 250.26 0.98 0.2 32.68 -
[1] 高阳, 陈世富, 陆鑫.强化学习研究综述.自动化学报, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtmlGao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning:a review. Acta Automatica Sinica, 2004, 30(1):86 -100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml [2] 唐昊, 万海峰, 韩江洪, 周雷.基于多Agent强化学习的多站点CSPS系统的协作Look-ahead控制.自动化学报, 2010, 36(2):289-296 http://www.aas.net.cn/CN/abstract/abstract13356.shtmlTang Hao, Wan Hai-Feng, Han Jiang-Hong, Zhou Lei. Coordinated look-ahead control of multiple CSPS system by multi-agent reinforcement learning. Acta Automatica Sinica, 2010, 36(2):289-296 http://www.aas.net.cn/CN/abstract/abstract13356.shtml [3] 秦蕊, 曾帅, 李娟娟, 袁勇.基于深度强化学习的平行企业资源计划.自动化学报, 2017, 43(9):1588-1596 http://www.aas.net.cn/CN/abstract/abstract19135.shtmlQin Rui, Zeng Shuai, Li Juan-Juan, Yuan Yong. Parallel enterprises resource planning based on deep reinforcement learning. Acta Automatica Sinica, 2017, 43(9):1588-1596 http://www.aas.net.cn/CN/abstract/abstract19135.shtml [4] Watkins C J C H, Dayan P. Q-learning. Machine Learning, 1992, 8(3-4):279-292 doi: 10.1007/BF00992698 [5] Maass W. Networks of spiking neurons:the third generation of neural network models. Neural Networks, 1997, 10(9):1659-1671 doi: 10.1016/S0893-6080(97)00011-7 [6] Florian R V. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural Computation, 2007, 19(6):1468-1502 doi: 10.1162/neco.2007.19.6.1468 [7] Cao Y Q, Chen Y, Khosla D. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision, 2015, 113(1):54-66 doi: 10.1007/s11263-014-0788-3 [8] Ghosh-Dastidar S, Adeli H. Spiking neural networks. International Journal of Neural Systems, 2009, 19(4):295-308 doi: 10.1142/S0129065709002002 [9] Ponulak F. Analysis of the ReSuMe learning process for spiking neural networks. International Journal of Applied Mathematics and Computer Science, 2008, 18(2):117-127 doi: 10.2478/v10006-008-0011-1 [10] Mostafa H. Supervised learning based on temporal coding in spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(7):3227-3235 http://d.old.wanfangdata.com.cn/OAPaper/oai_arXiv.org_1109.2788 [11] de Kamps M, van der Velde F. From artificial neural networks to spiking neurons and back again. Neural Networks, 2001, 14(6-7):941-953 doi: 10.1016/S0893-6080(01)00068-5 [12] Zheng N, Mazumder P. Learning in memristor crossbar-based spiking neural networks through modulation of weight dependent spike-timing-dependent plasticity. IEEE Transactions on Nanotechnology, 2018, 17(3):520-532 http://ieeexplore.ieee.org/document/8328902/ [13] Taherkhani A, Belatreche A, Li Y H, Maguire L P. A supervised learning algorithm for learning precise timing of multiple spikes in multilayer spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11):5394-5407 doi: 10.1109/TNNLS.2018.2797801 [14] Chua L O. Memristor--the missing circuit element. IEEE Transactions on Circuit Theory, 1971, 18(5):507-519 doi: 10.1109/TCT.1971.1083337 [15] Strukov D B, Snider G S, Stewart D R, Williams R S. The missing memristor found. Nature, 2008, 453(7191):80-83 doi: 10.1038/nature06932 [16] Kvatinsky S, Friedman E G, Kolodny A, Weiser U C. TEAM:threshold adaptive memristor model. IEEE Transactions on Circuits and Systems I--Regular Papers, 2013, 60(1):211-221 doi: 10.1109/TCSI.2012.2215714 [17] Hu X F, Feng G, Liu L, Duan S K. Composite characteristics of memristor series and parallel circuits. International Journal of Bifurcation and Chaos, 2015, 25(8):1530019 doi: 10.1142/S0218127415300190 [18] Jo S H, Chang T, Ebong I, Bhadviya B B, Mazumder P, Lu W. Nanoscale memristor device as synapse in neuromorphic systems. Nano letters, 2010, 10(4):1297-1301 doi: 10.1021/nl904092h [19] Panwar N, Rajendran B, Ganguly U. Arbitrary spike time dependent plasticity (STDP) in memristor by analog waveform engineering. IEEE Electron Device Letters, 2017, 38(6):740-743 doi: 10.1109/LED.2017.2696023 [20] Serrano-Gotarredona T, Masquelier T, Prodromakis T, Indiveri G, Linares-Barranco B. STDP and STDP variations with memristors for spiking neuromorphic learning systems. Frontiers in Neuroscience, 2013, 7(2), DOI:10.3389/fnins. 2013.00002 [21] Goodman D F M, Brette R. The brian simulator. Frontiers in Neuroscience, 2009, 3(2):192-197 http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_2751620 [22] Sutton R S, Barto A G. Reinforcement Learning:An Introduction. London:The MIT Press, 1999.185-187 [23] Ferré P, Mamalet F, Thorpe S J. Unsupervised feature learning with winner-takes-all based STDP. Frontiers in Computational Neuroscience, 2018, 12(24), DOI:10.3389/fncom. 2018.00024 [24] Gerstner W, Kistler W M. Spiking Neuron Models. New York:Cambridge University Press, 2002. [25] Hasselmo M E. Methods in neuronal modeling:from ions to networks. Science, 1998, 282(5391):1055-1055 doi: 10.1126/science.282.5391.1055 [26] Hebb D O, Martinez J L, Glickman S E. The organization of behavior:a neuropsychological theory. Contemporary Psychology, 1994, 39(11):1018-1020 doi: 10.1037/034206 [27] Markram H, Lubke J, Frotscher M, Sakmann B. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 1997, 275(5297):213-215 doi: 10.1126/science.275.5297.213 [28] Schaul T, Quan J, Antonoglou I, Silver D. Prioritized experience replay. In:Proceedings of the 4th International Conference on Learning Representations. Puerto Rico, San Juan:Cornell University Library, 2016. [29] Gollisch T, Meister M. Rapid neural coding in the retina with relative spike latencies. Science, 2008, 319(5866):1108 -1111 doi: 10.1126/science.1149639 [30] Kostal L, Lansky P, Rospars J P. Neuronal coding and spiking randomness. European Journal of Neuroscience, 2007, 26(10):2693-2701 doi: 10.1111/j.1460-9568.2007.05880.x [31] Legenstein R, Pecevski D, Maass W. A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback. Plos Computational Biology, 2008, 4(10):e1000180 doi: 10.1371/journal.pcbi.1000180 [32] Skorheim S, Lonjers P, Bazhenov M. A spiking network model of decision making employing rewarded STDP. Plos One, 2014, 9(3), DOI: 10.1371/journal.pone.0090821 [33] Zheng N, Mazumder P. Hardware-friendly actor-critic reinforcement learning through modulation of spike-timing-dependent plasticity. IEEE Transactions on Computers, 2017, 66(2):299-311 doi: 10.1109/TC.2016.2595580 [34] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing atari with deep reinforcement learning. In:Proceedings of the 26th Conference and Workshop on Neural Information Processing Systems, Nevada, USA:Cornell University Library. 2013. [35] Li C Y, Lu J T, Wu C P, Duan S M, Poo M M. Bidirectional modification of presynaptic neuronal excitability accompanying spike timing-dependent synaptic plasticity, Neuron, 2004, 41(2):257-268 doi: 10.1016/S0896-6273(03)00847-X [36] Brette R, Rudolph M, Carnevale T, Hines M, Beeman D, Bower J M, et al. Simulation of networks of spiking neurons:a review of tools and strategies. Journal of Computational Neuroscience, 2007, 23(3):349-398 doi: 10.1007/s10827-007-0038-6