2.765

2022影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于多层忆阻脉冲神经网络的强化学习及应用

张耀中 胡小方 周跃 段书凯

张耀中, 胡小方, 周跃, 段书凯. 基于多层忆阻脉冲神经网络的强化学习及应用. 自动化学报, 2019, 45(8): 1536-1547. doi: 10.16383/j.aas.c180685
引用本文: 张耀中, 胡小方, 周跃, 段书凯. 基于多层忆阻脉冲神经网络的强化学习及应用. 自动化学报, 2019, 45(8): 1536-1547. doi: 10.16383/j.aas.c180685
ZHANG Yao-Zhong, HU Xiao-Fang, ZHOU Yue, DUAN Shu-Kai. A Novel Reinforcement Learning Algorithm Based on Multilayer Memristive Spiking Neural Network With Applications. ACTA AUTOMATICA SINICA, 2019, 45(8): 1536-1547. doi: 10.16383/j.aas.c180685
Citation: ZHANG Yao-Zhong, HU Xiao-Fang, ZHOU Yue, DUAN Shu-Kai. A Novel Reinforcement Learning Algorithm Based on Multilayer Memristive Spiking Neural Network With Applications. ACTA AUTOMATICA SINICA, 2019, 45(8): 1536-1547. doi: 10.16383/j.aas.c180685

基于多层忆阻脉冲神经网络的强化学习及应用

doi: 10.16383/j.aas.c180685
基金项目: 

中国博士后科学基金 2018T110937

中央高校基本科研业务费 XDJK2019C034

国家级大学生创新创业训练计划项目 201810635017

重庆市博士后科学基金 Xm2017039

国家自然科学基金 61601376

国家自然科学基金 61672436

重庆市基础与前沿技术研究专项 cstc2016jcyjA0547

详细信息
    作者简介:

    张耀中  西南大学计算机与信息科学学院本科生.主要研究方向为强化学习, 脉冲神经网络理论与应用.E-mail:zhangyaozhong9@126.com

    周跃  西南大学电子信息工程学院研究助理.2012年获得南京大学工程管理学院硕士学位.主要研究方向为机器学习, 深度学习, 信息安全, 忆阻器件与系统.E-mail:zhouyuenju@163.com

    段书凯  西南大学人工智能学院教授.2006年获得重庆大学计算机科学学院博士学位.主要研究方向为纳米信息器件与系统, 神经形态计算系统, 非线性电路与系统, 机器学习.E-mail:duansk@swu.edu.cn

    通讯作者:

    胡小方  西南大学人工智能学院副教授.2015年获得中国香港城市大学机械与生物医学工程系博士学位.主要研究方向为忆阻器件与系统应用, 神经网络算法, 模型与硬件实现, 强化学习, 图像处理.本文通信作者.E-mail:huxf@swu.edu.cn

A Novel Reinforcement Learning Algorithm Based on Multilayer Memristive Spiking Neural Network With Applications

Funds: 

Special Science Foundation of Chinese Postdoctoral Fellow 2018T110937

Fundamental Research Funds for the Central Universities XDJK2019C034

National Student0s Platform for Innovation and Entrepreneurship Training Program 201810635017

Special Foundation of Postdoctoral Fellow of Chongqing Xm2017039

National Natural Science Foundation of China 61601376

National Natural Science Foundation of China 61672436

Fundamental Science and Advanced Technology Research Foundation of Chongqing cstc2016jcyjA0547

More Information
    Author Bio:

     Undergraduate at the College of Computer and Information Science, Southwest University. His research interest covers reinforcement learning, theories and applications of spiking neural networks

     Research assistant at the College of Electronic and Information Engineering, Southwest University. He received his master degree from Nanjing University in 2012. His research interest covers machine learning, deep learning, information security, memristor devices and systems

     Professor at the College of Artiflcial Intelligence, Southwest University. He received his Ph. D. degree from Chongqing University in 2006. His research interest covers nano-information devices and systems, nonlinear circuits and systems, and machine learning

    Corresponding author: HU Xiao-Fang  Associate professor at the College of Artiflcial Intelligence, Southwest University. She received her Ph. D. degree from City University of Hong Kong, China in 2015. Her research interest covers memristive devices and system applications, neural network algorithm, model and hardware implementation, reinforcement learning, image processing. Corresponding author of this paper
  • 摘要: 人工神经网络(Artificial neural networks,ANNs)与强化学习算法的结合显著增强了智能体的学习能力和效率.然而,这些算法需要消耗大量的计算资源,且难以硬件实现.而脉冲神经网络(Spiking neural networks,SNNs)使用脉冲信号来传递信息,具有能量效率高、仿生特性强等特点,且有利于进一步实现强化学习的硬件加速,增强嵌入式智能体的自主学习能力.不过,目前脉冲神经网络的学习和训练过程较为复杂,网络设计和实现方面存在较大挑战.本文通过引入人工突触的理想实现元件——忆阻器,提出了一种硬件友好的基于多层忆阻脉冲神经网络的强化学习算法.特别地,设计了用于数据——脉冲转换的脉冲神经元;通过改进脉冲时间依赖可塑性(Spiking-timing dependent plasticity,STDP)规则,使脉冲神经网络与强化学习算法有机结合,并设计了对应的忆阻神经突触;构建了可动态调整的网络结构,以提高网络的学习效率;最后,以Open AI Gym中的CartPole-v0(倒立摆)和MountainCar-v0(小车爬坡)为例,通过实验仿真和对比分析,验证了方案的有效性和相对于传统强化学习方法的优势.
    1)  本文责任编委 张敏灵
  • 图  1  Q学习过程

    Fig.  1  The process of Q-learning

    图  2  LIF模型

    Fig.  2  LIF model

    图  3  HP忆阻器模型示意图

    Fig.  3  HP memristor

    图  4  脉冲神经网络结构

    Fig.  4  The structure of SNN

    图  5  脉冲神经元响应

    Fig.  5  The response of spiking neurons

    图  6  忆阻脉冲神经网络的训练过程

    Fig.  6  The training process of memristive spiking neural network

    图  7  CartPole-v0示意图

    Fig.  7  CartPole-v0

    图  8  MountainCar-v0示意图

    Fig.  8  MountainCar-v0

    图  9  MSRL训练前后样本状态分布对比

    Fig.  9  The comparison of sample states distribution before and after training of MSRL

    图  10  比较结果(A)

    Fig.  10  The results of comparison (A)

    表  1  不同隐含层神经元数量TD方差对比

    Table  1  The comparison of TD variance for difierent hidden neurons

    任务CartPole-v0MountainCar-v0
    $\rm Hidden = 1$27.145.17
    $\rm Hidden = 2$24.525.03
    $\rm Hidden = 4$21.24.96
    $\rm Hidden = 6$19.454.87
    $\rm Hidden = 10$17.264.79
    $\rm Hidden = 12$14.044.65
    下载: 导出CSV

    表  2  比较结果(B)

    Table  2  The results of comparison (B)

    评价指标平均迭代步数平均分数平均CPU利用率(%)运行时间(s)
    MSRL (CartPole-v0)98.931.2812.03 528.38
    DQN (CartPole-v0)61.791.2223.51 119.52
    Q-learning (CartPole-v0)11.831.140.3105.60
    MSRL (MountainCar-v0)183.871.2311.81 358.14
    DQN (MountainCar-v0)204.321.1222.9359.21
    Q-learning (MountainCar-v0)250.260.980.232.68
    下载: 导出CSV
  • [1] 高阳, 陈世富, 陆鑫.强化学习研究综述.自动化学报, 2004, 30(1):86-100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml

    Gao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning:a review. Acta Automatica Sinica, 2004, 30(1):86 -100 http://www.aas.net.cn/CN/abstract/abstract16352.shtml
    [2] 唐昊, 万海峰, 韩江洪, 周雷.基于多Agent强化学习的多站点CSPS系统的协作Look-ahead控制.自动化学报, 2010, 36(2):289-296 http://www.aas.net.cn/CN/abstract/abstract13356.shtml

    Tang Hao, Wan Hai-Feng, Han Jiang-Hong, Zhou Lei. Coordinated look-ahead control of multiple CSPS system by multi-agent reinforcement learning. Acta Automatica Sinica, 2010, 36(2):289-296 http://www.aas.net.cn/CN/abstract/abstract13356.shtml
    [3] 秦蕊, 曾帅, 李娟娟, 袁勇.基于深度强化学习的平行企业资源计划.自动化学报, 2017, 43(9):1588-1596 http://www.aas.net.cn/CN/abstract/abstract19135.shtml

    Qin Rui, Zeng Shuai, Li Juan-Juan, Yuan Yong. Parallel enterprises resource planning based on deep reinforcement learning. Acta Automatica Sinica, 2017, 43(9):1588-1596 http://www.aas.net.cn/CN/abstract/abstract19135.shtml
    [4] Watkins C J C H, Dayan P. Q-learning. Machine Learning, 1992, 8(3-4):279-292 doi: 10.1007/BF00992698
    [5] Maass W. Networks of spiking neurons:the third generation of neural network models. Neural Networks, 1997, 10(9):1659-1671 doi: 10.1016/S0893-6080(97)00011-7
    [6] Florian R V. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural Computation, 2007, 19(6):1468-1502 doi: 10.1162/neco.2007.19.6.1468
    [7] Cao Y Q, Chen Y, Khosla D. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision, 2015, 113(1):54-66 doi: 10.1007/s11263-014-0788-3
    [8] Ghosh-Dastidar S, Adeli H. Spiking neural networks. International Journal of Neural Systems, 2009, 19(4):295-308 doi: 10.1142/S0129065709002002
    [9] Ponulak F. Analysis of the ReSuMe learning process for spiking neural networks. International Journal of Applied Mathematics and Computer Science, 2008, 18(2):117-127 doi: 10.2478/v10006-008-0011-1
    [10] Mostafa H. Supervised learning based on temporal coding in spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(7):3227-3235 http://d.old.wanfangdata.com.cn/OAPaper/oai_arXiv.org_1109.2788
    [11] de Kamps M, van der Velde F. From artificial neural networks to spiking neurons and back again. Neural Networks, 2001, 14(6-7):941-953 doi: 10.1016/S0893-6080(01)00068-5
    [12] Zheng N, Mazumder P. Learning in memristor crossbar-based spiking neural networks through modulation of weight dependent spike-timing-dependent plasticity. IEEE Transactions on Nanotechnology, 2018, 17(3):520-532 http://ieeexplore.ieee.org/document/8328902/
    [13] Taherkhani A, Belatreche A, Li Y H, Maguire L P. A supervised learning algorithm for learning precise timing of multiple spikes in multilayer spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(11):5394-5407 doi: 10.1109/TNNLS.2018.2797801
    [14] Chua L O. Memristor--the missing circuit element. IEEE Transactions on Circuit Theory, 1971, 18(5):507-519 doi: 10.1109/TCT.1971.1083337
    [15] Strukov D B, Snider G S, Stewart D R, Williams R S. The missing memristor found. Nature, 2008, 453(7191):80-83 doi: 10.1038/nature06932
    [16] Kvatinsky S, Friedman E G, Kolodny A, Weiser U C. TEAM:threshold adaptive memristor model. IEEE Transactions on Circuits and Systems I--Regular Papers, 2013, 60(1):211-221 doi: 10.1109/TCSI.2012.2215714
    [17] Hu X F, Feng G, Liu L, Duan S K. Composite characteristics of memristor series and parallel circuits. International Journal of Bifurcation and Chaos, 2015, 25(8):1530019 doi: 10.1142/S0218127415300190
    [18] Jo S H, Chang T, Ebong I, Bhadviya B B, Mazumder P, Lu W. Nanoscale memristor device as synapse in neuromorphic systems. Nano letters, 2010, 10(4):1297-1301 doi: 10.1021/nl904092h
    [19] Panwar N, Rajendran B, Ganguly U. Arbitrary spike time dependent plasticity (STDP) in memristor by analog waveform engineering. IEEE Electron Device Letters, 2017, 38(6):740-743 doi: 10.1109/LED.2017.2696023
    [20] Serrano-Gotarredona T, Masquelier T, Prodromakis T, Indiveri G, Linares-Barranco B. STDP and STDP variations with memristors for spiking neuromorphic learning systems. Frontiers in Neuroscience, 2013, 7(2), DOI:10.3389/fnins. 2013.00002
    [21] Goodman D F M, Brette R. The brian simulator. Frontiers in Neuroscience, 2009, 3(2):192-197 http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_2751620
    [22] Sutton R S, Barto A G. Reinforcement Learning:An Introduction. London:The MIT Press, 1999.185-187
    [23] Ferré P, Mamalet F, Thorpe S J. Unsupervised feature learning with winner-takes-all based STDP. Frontiers in Computational Neuroscience, 2018, 12(24), DOI:10.3389/fncom. 2018.00024
    [24] Gerstner W, Kistler W M. Spiking Neuron Models. New York:Cambridge University Press, 2002.
    [25] Hasselmo M E. Methods in neuronal modeling:from ions to networks. Science, 1998, 282(5391):1055-1055 doi: 10.1126/science.282.5391.1055
    [26] Hebb D O, Martinez J L, Glickman S E. The organization of behavior:a neuropsychological theory. Contemporary Psychology, 1994, 39(11):1018-1020 doi: 10.1037/034206
    [27] Markram H, Lubke J, Frotscher M, Sakmann B. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 1997, 275(5297):213-215 doi: 10.1126/science.275.5297.213
    [28] Schaul T, Quan J, Antonoglou I, Silver D. Prioritized experience replay. In:Proceedings of the 4th International Conference on Learning Representations. Puerto Rico, San Juan:Cornell University Library, 2016.
    [29] Gollisch T, Meister M. Rapid neural coding in the retina with relative spike latencies. Science, 2008, 319(5866):1108 -1111 doi: 10.1126/science.1149639
    [30] Kostal L, Lansky P, Rospars J P. Neuronal coding and spiking randomness. European Journal of Neuroscience, 2007, 26(10):2693-2701 doi: 10.1111/j.1460-9568.2007.05880.x
    [31] Legenstein R, Pecevski D, Maass W. A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback. Plos Computational Biology, 2008, 4(10):e1000180 doi: 10.1371/journal.pcbi.1000180
    [32] Skorheim S, Lonjers P, Bazhenov M. A spiking network model of decision making employing rewarded STDP. Plos One, 2014, 9(3), DOI: 10.1371/journal.pone.0090821
    [33] Zheng N, Mazumder P. Hardware-friendly actor-critic reinforcement learning through modulation of spike-timing-dependent plasticity. IEEE Transactions on Computers, 2017, 66(2):299-311 doi: 10.1109/TC.2016.2595580
    [34] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing atari with deep reinforcement learning. In:Proceedings of the 26th Conference and Workshop on Neural Information Processing Systems, Nevada, USA:Cornell University Library. 2013.
    [35] Li C Y, Lu J T, Wu C P, Duan S M, Poo M M. Bidirectional modification of presynaptic neuronal excitability accompanying spike timing-dependent synaptic plasticity, Neuron, 2004, 41(2):257-268 doi: 10.1016/S0896-6273(03)00847-X
    [36] Brette R, Rudolph M, Carnevale T, Hines M, Beeman D, Bower J M, et al. Simulation of networks of spiking neurons:a review of tools and strategies. Journal of Computational Neuroscience, 2007, 23(3):349-398 doi: 10.1007/s10827-007-0038-6
  • 加载中
图(10) / 表(2)
计量
  • 文章访问数:  2591
  • HTML全文浏览量:  539
  • PDF下载量:  281
  • 被引次数: 0
出版历程
  • 收稿日期:  2018-10-22
  • 录用日期:  2018-12-26
  • 刊出日期:  2019-08-20

目录

    /

    返回文章
    返回