基于多层忆阻脉冲神经网络的强化学习及应用

张耀中; 胡小方; 周跃; 段书凯

doi:10.16383/j.aas.c180685

基于多层忆阻脉冲神经网络的强化学习及应用

doi: 10.16383/j.aas.c180685

张耀中^1,,
胡小方^2,3, ,,
周跃^3,4,,
段书凯^2,3,

1.
西南大学计算机与信息科学学院重庆 400715
2.
西南大学人工智能学院重庆 400715
3.
类脑计算与智能控制重庆市重点实验室重庆 400715
4.
西南大学电子信息工程学院重庆 400715

基金项目:

中国博士后科学基金 2018T110937

中央高校基本科研业务费 XDJK2019C034

国家级大学生创新创业训练计划项目 201810635017

重庆市博士后科学基金 Xm2017039

国家自然科学基金 61601376

国家自然科学基金 61672436

重庆市基础与前沿技术研究专项 cstc2016jcyjA0547

详细信息

作者简介:
张耀中  西南大学计算机与信息科学学院本科生.主要研究方向为强化学习, 脉冲神经网络理论与应用.E-mail:zhangyaozhong9@126.com

周跃  西南大学电子信息工程学院研究助理.2012年获得南京大学工程管理学院硕士学位.主要研究方向为机器学习, 深度学习, 信息安全, 忆阻器件与系统.E-mail:zhouyuenju@163.com

段书凯  西南大学人工智能学院教授.2006年获得重庆大学计算机科学学院博士学位.主要研究方向为纳米信息器件与系统, 神经形态计算系统, 非线性电路与系统, 机器学习.E-mail:duansk@swu.edu.cn

通讯作者:
胡小方西南大学人工智能学院副教授.2015年获得中国香港城市大学机械与生物医学工程系博士学位.主要研究方向为忆阻器件与系统应用, 神经网络算法, 模型与硬件实现, 强化学习, 图像处理.本文通信作者.E-mail:huxf@swu.edu.cn

计量
- 文章访问数: 2830
- HTML全文浏览量: 580
- PDF下载量: 296
- 被引次数: 0
出版历程
- 收稿日期: 2018-10-22
- 录用日期: 2018-12-26
- 刊出日期: 2019-08-20

A Novel Reinforcement Learning Algorithm Based on Multilayer Memristive Spiking Neural Network With Applications

ZHANG Yao-Zhong^1
,,
HU Xiao-Fang^{2,3
, ,},
ZHOU Yue^{3,4
,},
DUAN Shu-Kai^{2,3
,}

1.
School of Computer and Information Science, Southwest University, Chongqing 400715
2.
School of Artificial Intelligence, Southwest University, Chongqing 400715
3.
Brain-inspired Computing and Intelligent Control of Chongqing Key Laboratory, Chongqing 400715
4.
College of Electronic and Information Engineering, Southwest University, Chongqing 400715

Funds:

Special Science Foundation of Chinese Postdoctoral Fellow 2018T110937

Fundamental Research Funds for the Central Universities XDJK2019C034

National Student0s Platform for Innovation and Entrepreneurship Training Program 201810635017

Special Foundation of Postdoctoral Fellow of Chongqing Xm2017039

National Natural Science Foundation of China 61601376

National Natural Science Foundation of China 61672436

Fundamental Science and Advanced Technology Research Foundation of Chongqing cstc2016jcyjA0547

More Information

Author Bio:
Undergraduate at the College of Computer and Information Science, Southwest University. His research interest covers reinforcement learning, theories and applications of spiking neural networks

Research assistant at the College of Electronic and Information Engineering, Southwest University. He received his master degree from Nanjing University in 2012. His research interest covers machine learning, deep learning, information security, memristor devices and systems

Professor at the College of Artiflcial Intelligence, Southwest University. He received his Ph. D. degree from Chongqing University in 2006. His research interest covers nano-information devices and systems, nonlinear circuits and systems, and machine learning

Corresponding author: HU Xiao-Fang Associate professor at the College of Artiflcial Intelligence, Southwest University. She received her Ph. D. degree from City University of Hong Kong, China in 2015. Her research interest covers memristive devices and system applications, neural network algorithm, model and hardware implementation, reinforcement learning, image processing. Corresponding author of this paper

摘要

摘要: 人工神经网络（Artificial neural networks，ANNs）与强化学习算法的结合显著增强了智能体的学习能力和效率.然而，这些算法需要消耗大量的计算资源，且难以硬件实现.而脉冲神经网络（Spiking neural networks，SNNs）使用脉冲信号来传递信息，具有能量效率高、仿生特性强等特点，且有利于进一步实现强化学习的硬件加速，增强嵌入式智能体的自主学习能力.不过，目前脉冲神经网络的学习和训练过程较为复杂，网络设计和实现方面存在较大挑战.本文通过引入人工突触的理想实现元件——忆阻器，提出了一种硬件友好的基于多层忆阻脉冲神经网络的强化学习算法.特别地，设计了用于数据——脉冲转换的脉冲神经元；通过改进脉冲时间依赖可塑性（Spiking-timing dependent plasticity，STDP）规则，使脉冲神经网络与强化学习算法有机结合，并设计了对应的忆阻神经突触；构建了可动态调整的网络结构，以提高网络的学习效率；最后，以Open AI Gym中的CartPole-v0（倒立摆）和MountainCar-v0（小车爬坡）为例，通过实验仿真和对比分析，验证了方案的有效性和相对于传统强化学习方法的优势.
- 强化学习 /
- 脉冲神经网络 /
- 脉冲时间依赖可塑性规则 /
- 忆阻器
Abstract: The combination of reinforcement learning algorithms with artificial neural networks (ANNs) enhances the learning ability of agents effectively. However, these algorithms consume a large number of computing resources, which are unfavourable for hardware implementation. Bionic spiking neural networks (SNNs) convey information by spikes and possess energy-efficient and hardware-friendly features. It is promising to accelerate reinforcement learning and develop embedded self-learning agents based on SNNs. Nevertheless, SNNs lack efficient learning algorithms and their training processes are really complex. As a result, it is challenging to design and implement SNNs. This paper proposes a hardware-friendly reinforcement learning algorithm based on an SNN by introducing famous artificial synapse element:memristor. Data-spike switching spiking neurons are designed especially. Then, we improve spiking-timing-dependent plasticity (STDP) rule to combine the SNN with reinforcement learning organically and the corresponding memristive synapses are created. Besides, the dynamic adjustable network structure is created to increase learning efficiency. Finally, a series of simulations show the effectiveness and advantages of the proposed scheme over conventional reinforcement learning algorithms in applications of CartPole-v0 and MountainCar-v0 in Open AI Gym environment.
- Reinforcement learning /
- spiking neural network (SNN) /
- spike-timing-dependent plasticity (STDP) /
- memristor
注释:

1) 本文责任编委张敏灵

HTML全文

图 1 Q学习过程

Fig. 1 The process of Q-learning

下载: 全尺寸图片幻灯片

图 2 LIF模型

Fig. 2 LIF model

下载: 全尺寸图片幻灯片

图 3 HP忆阻器模型示意图

Fig. 3 HP memristor

下载: 全尺寸图片幻灯片

图 4 脉冲神经网络结构

Fig. 4 The structure of SNN

下载: 全尺寸图片幻灯片

图 5 脉冲神经元响应

Fig. 5 The response of spiking neurons

下载: 全尺寸图片幻灯片

图 6 忆阻脉冲神经网络的训练过程

Fig. 6 The training process of memristive spiking neural network

下载: 全尺寸图片幻灯片

图 7 CartPole-v0示意图

Fig. 7 CartPole-v0

下载: 全尺寸图片幻灯片

图 8 MountainCar-v0示意图

Fig. 8 MountainCar-v0

下载: 全尺寸图片幻灯片

图 9 MSRL训练前后样本状态分布对比

Fig. 9 The comparison of sample states distribution before and after training of MSRL

下载: 全尺寸图片幻灯片

图 10 比较结果(A)

Fig. 10 The results of comparison (A)

下载: 全尺寸图片幻灯片

表 1 不同隐含层神经元数量TD方差对比

Table 1 The comparison of TD variance for difierent hidden neurons

任务	CartPole-v0	MountainCar-v0
$\rm Hidden = 1$	27.14	5.17
$\rm Hidden = 2$	24.52	5.03
$\rm Hidden = 4$	21.2	4.96
$\rm Hidden = 6$	19.45	4.87
$\rm Hidden = 10$	17.26	4.79
$\rm Hidden = 12$	14.04	4.65

下载: 导出CSV

表 2 比较结果(B)

Table 2 The results of comparison (B)

评价指标	平均迭代步数	平均分数	平均CPU利用率(%)	运行时间(s)
MSRL (CartPole-v0)	98.93	1.28	12.0	3 528.38
DQN (CartPole-v0)	61.79	1.22	23.5	1 119.52
Q-learning (CartPole-v0)	11.83	1.14	0.3	105.60
MSRL (MountainCar-v0)	183.87	1.23	11.8	1 358.14
DQN (MountainCar-v0)	204.32	1.12	22.9	359.21
Q-learning (MountainCar-v0)	250.26	0.98	0.2	32.68

下载: 导出CSV

参考文献(36)

[1]	高阳, 陈世富, 陆鑫.强化学习研究综述.自动化学报, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml Gao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning:a review. Acta Automatica Sinica, 2004, 30(1):86 -100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml
[2]	唐昊, 万海峰, 韩江洪, 周雷.基于多Agent强化学习的多站点CSPS系统的协作Look-ahead控制.自动化学报, 2010, 36(2):289-296 http://www.aas.net.cn/CN/abstract/abstract13356.shtml Tang Hao, Wan Hai-Feng, Han Jiang-Hong, Zhou Lei. Coordinated look-ahead control of multiple CSPS system by multi-agent reinforcement learning. Acta Automatica Sinica, 2010, 36(2):289-296 http://www.aas.net.cn/CN/abstract/abstract13356.shtml
[3]	秦蕊, 曾帅, 李娟娟, 袁勇.基于深度强化学习的平行企业资源计划.自动化学报, 2017, 43(9):1588-1596 http://www.aas.net.cn/CN/abstract/abstract19135.shtml Qin Rui, Zeng Shuai, Li Juan-Juan, Yuan Yong. Parallel enterprises resource planning based on deep reinforcement learning. Acta Automatica Sinica, 2017, 43(9):1588-1596 http://www.aas.net.cn/CN/abstract/abstract19135.shtml
[4]	Watkins C J C H, Dayan P. Q-learning. Machine Learning, 1992, 8(3-4):279-292 doi: 10.1007/BF00992698
[5]	Maass W. Networks of spiking neurons:the third generation of neural network models. Neural Networks, 1997, 10(9):1659-1671 doi: 10.1016/S0893-6080(97)00011-7
[6]	Florian R V. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural Computation, 2007, 19(6):1468-1502 doi: 10.1162/neco.2007.19.6.1468
[7]	Cao Y Q, Chen Y, Khosla D. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision, 2015, 113(1):54-66 doi: 10.1007/s11263-014-0788-3
[8]	Ghosh-Dastidar S, Adeli H. Spiking neural networks. International Journal of Neural Systems, 2009, 19(4):295-308 doi: 10.1142/S0129065709002002
[9]	Ponulak F. Analysis of the ReSuMe learning process for spiking neural networks. International Journal of Applied Mathematics and Computer Science, 2008, 18(2):117-127 doi: 10.2478/v10006-008-0011-1
[10]	Mostafa H. Supervised learning based on temporal coding in spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(7):3227-3235 http://d.old.wanfangdata.com.cn/OAPaper/oai_arXiv.org_1109.2788
[11]	de Kamps M, van der Velde F. From artificial neural networks to spiking neurons and back again. Neural Networks, 2001, 14(6-7):941-953 doi: 10.1016/S0893-6080(01)00068-5
[12]	Zheng N, Mazumder P. Learning in memristor crossbar-based spiking neural networks through modulation of weight dependent spike-timing-dependent plasticity. IEEE Transactions on Nanotechnology, 2018, 17(3):520-532 http://ieeexplore.ieee.org/document/8328902/
[13]	Taherkhani A, Belatreche A, Li Y H, Maguire L P. A supervised learning algorithm for learning precise timing of multiple spikes in multilayer spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11):5394-5407 doi: 10.1109/TNNLS.2018.2797801
[14]	Chua L O. Memristor--the missing circuit element. IEEE Transactions on Circuit Theory, 1971, 18(5):507-519 doi: 10.1109/TCT.1971.1083337
[15]	Strukov D B, Snider G S, Stewart D R, Williams R S. The missing memristor found. Nature, 2008, 453(7191):80-83 doi: 10.1038/nature06932
[16]	Kvatinsky S, Friedman E G, Kolodny A, Weiser U C. TEAM:threshold adaptive memristor model. IEEE Transactions on Circuits and Systems I--Regular Papers, 2013, 60(1):211-221 doi: 10.1109/TCSI.2012.2215714
[17]	Hu X F, Feng G, Liu L, Duan S K. Composite characteristics of memristor series and parallel circuits. International Journal of Bifurcation and Chaos, 2015, 25(8):1530019 doi: 10.1142/S0218127415300190
[18]	Jo S H, Chang T, Ebong I, Bhadviya B B, Mazumder P, Lu W. Nanoscale memristor device as synapse in neuromorphic systems. Nano letters, 2010, 10(4):1297-1301 doi: 10.1021/nl904092h
[19]	Panwar N, Rajendran B, Ganguly U. Arbitrary spike time dependent plasticity (STDP) in memristor by analog waveform engineering. IEEE Electron Device Letters, 2017, 38(6):740-743 doi: 10.1109/LED.2017.2696023
[20]	Serrano-Gotarredona T, Masquelier T, Prodromakis T, Indiveri G, Linares-Barranco B. STDP and STDP variations with memristors for spiking neuromorphic learning systems. Frontiers in Neuroscience, 2013, 7(2), DOI:10.3389/fnins. 2013.00002
[21]	Goodman D F M, Brette R. The brian simulator. Frontiers in Neuroscience, 2009, 3(2):192-197 http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_2751620
[22]	Sutton R S, Barto A G. Reinforcement Learning:An Introduction. London:The MIT Press, 1999.185-187
[23]	Ferré P, Mamalet F, Thorpe S J. Unsupervised feature learning with winner-takes-all based STDP. Frontiers in Computational Neuroscience, 2018, 12(24), DOI:10.3389/fncom. 2018.00024
[24]	Gerstner W, Kistler W M. Spiking Neuron Models. New York:Cambridge University Press, 2002.
[25]	Hasselmo M E. Methods in neuronal modeling:from ions to networks. Science, 1998, 282(5391):1055-1055 doi: 10.1126/science.282.5391.1055
[26]	Hebb D O, Martinez J L, Glickman S E. The organization of behavior:a neuropsychological theory. Contemporary Psychology, 1994, 39(11):1018-1020 doi: 10.1037/034206
[27]	Markram H, Lubke J, Frotscher M, Sakmann B. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 1997, 275(5297):213-215 doi: 10.1126/science.275.5297.213
[28]	Schaul T, Quan J, Antonoglou I, Silver D. Prioritized experience replay. In:Proceedings of the 4th International Conference on Learning Representations. Puerto Rico, San Juan:Cornell University Library, 2016.
[29]	Gollisch T, Meister M. Rapid neural coding in the retina with relative spike latencies. Science, 2008, 319(5866):1108 -1111 doi: 10.1126/science.1149639
[30]	Kostal L, Lansky P, Rospars J P. Neuronal coding and spiking randomness. European Journal of Neuroscience, 2007, 26(10):2693-2701 doi: 10.1111/j.1460-9568.2007.05880.x
[31]	Legenstein R, Pecevski D, Maass W. A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback. Plos Computational Biology, 2008, 4(10):e1000180 doi: 10.1371/journal.pcbi.1000180
[32]	Skorheim S, Lonjers P, Bazhenov M. A spiking network model of decision making employing rewarded STDP. Plos One, 2014, 9(3), DOI: 10.1371/journal.pone.0090821
[33]	Zheng N, Mazumder P. Hardware-friendly actor-critic reinforcement learning through modulation of spike-timing-dependent plasticity. IEEE Transactions on Computers, 2017, 66(2):299-311 doi: 10.1109/TC.2016.2595580
[34]	Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing atari with deep reinforcement learning. In:Proceedings of the 26th Conference and Workshop on Neural Information Processing Systems, Nevada, USA:Cornell University Library. 2013.
[35]	Li C Y, Lu J T, Wu C P, Duan S M, Poo M M. Bidirectional modification of presynaptic neuronal excitability accompanying spike timing-dependent synaptic plasticity, Neuron, 2004, 41(2):257-268 doi: 10.1016/S0896-6273(03)00847-X
[36]	Brette R, Rudolph M, Carnevale T, Hines M, Beeman D, Bower J M, et al. Simulation of networks of spiking neurons:a review of tools and strategies. Journal of Computational Neuroscience, 2007, 23(3):349-398 doi: 10.1007/s10827-007-0038-6