-
摘要: 动量算法理论上可以加速受限玻尔兹曼机(Restricted Boltzmann machine,RBM)网络的训练速度.本文通过对现有动量算法进行仿真研究,发现现有动量算法在受限玻尔兹曼机网络训练中加速效果较差,且在训练后期逐渐失去了加速性能.针对以上问题,本文首先基于Gibbs采样收敛性定理对现有动量算法进行了理论分析,证明了现有动量算法的加速效果是以牺牲网络权值为代价的;然后,本文进一步对网络权值进行研究,发现网络权值中包含大量真实梯度的方向信息,这些方向信息可以用来对网络进行训练;基于此,本文提出了基于网络权值的权值动量算法,最后给出了仿真实验.实验结果表明,本文提出的动量算法具有更好的加速效果,并且在训练后期仍然能够保持较好的加速性能,可以很好地弥补现有动量算法的不足.Abstract: Momentum algorithms can accelerate the training speed of restricted Boltzmann machine theoretically. Through a simulation study on existing momentum algorithms, it is found that existing momentum algorithms for training restricted Boltzmann machine have a poor accelerating effect and they began to lose acceleration performance. In the latter part of training process. Focusing on this problem, firstly, this paper gives a theoretical analysis of the algorithms based on Gibbs sampling convergence theorem. It is proved that the acceleration effect of existing momentum algorithms is at the expense of enlarging network weights. Then, a further investigation on network weights shows that the network weights contain a lot of information of the true gradient direction which can be used to train the network. According to this, a weight momentum algorithm is proposed based on the weight of the network. Finally, simulation results demonstrate that the proposed algorithm has a better acceleration effect and has the accelerating ability even in the end of the training process. Therefore the proposed algorithm can well make up for the weaknesses of existing momentum algorithms.
-
Key words:
- Deep learning /
- restricted Boltzmann machine (RBM) /
- momentum algorithm /
- weight momentum
1) 本文责任编委 魏庆来 -
表 1 网络参数值
Table 1 The value of network parameters
网络参数 初始值 $a$ zeros $(1, 784) $ $b$ zeros $(1, 500) $ $w$ $0.1\times randn(784,500)$ $\eta $ $0.1$ $\mu $ $0.9 $ 表 2 训练参数
Table 2 Training parameters
算法参数 $\mu $ $\lambda$ $\alpha $ CD 0.9 CM 0.9 NM 0.9 CMD 0.9 0.00001 NMD 0.9 0.00001 CDW 0.9 0.0001 CMW 0.9 0.0001 NMW 0.9 0.0001 表 3 记号示意图
Table 3 Sign diagram
代号 差值项 A CM-CD B NM-CD C CMW-CD D NMW-CD E CDW-CD F CMW-CD G NMW-CD 表 4 网络参数值
Table 4 The value of network parameters
网络参数 初始值 $a$ zeros $(1, 1024) $ $b$ zeros $(1, 800) $ $w$ $0.1\times randn(1024,800)$ $\eta $ $0.01$ $\mu $ $0.9$ 表 5 网络参数值
Table 5 The value of network parameters
网络参数 初始值 $a$ zeros $(1, 3072) $ $b$ zeros $(1, 2000) $ $w$ $0.1\times randn(3072,2000)$ $\eta $ $0.01$ $\mu $ $0.9$ 表 6 网络参数值
Table 6 The value of network parameters
网络参数 初始值 $a$ zeros $(1, 3072) $ $b$ zeros $(1, 2000) $ $w$ $0.1\times randn(3072,2000)$ $\eta $ $0.01$ $\mu $ $0.9 $ 表 7 网络参数值
Table 7 The value of network parameters
网络参数 初始值 $a$ zeros $(1, 4096) $ $b$ zeros $(1, 3000) $ $w$ $0.1\times randn(4096,3000)$ $\eta $ $0.01$ $\mu $ $0.9 $ -
[1] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786):504-507 doi: 10.1126/science.1127647 [2] Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets. Neural Computation, 2006, 18(7):1527-1554 doi: 10.1162/neco.2006.18.7.1527 [3] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In:Proceedings of Advances in Neural Information Processing Systems 25. Cambridge, MA:MIT Press, 2012. [4] Bengio Y. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2009, 21(6):1-27 http://www.iro.umontreal.ca/~pift6266/A08/documents/ftml.pdf [5] Deng L, Abdel-Hamid O, Yu D. A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In:Proceedings of the 2013 International Conference on Acoustics Speech and Signal Processing (ICASSP). Vancouver, BC, Canada:IEEE, 2013. 6669-6673 [6] Deng L. Design and learning of output representations for speech recognition. In:Neural Information Processing Systems (NIPS) Workshop on Learning Output Representations. Lake Tahoe, USA:NIPS, 2013. [7] Tan C C, Eswaran C. Reconstruction and recognition of face and digit images using autoencoders. Neural Computing and Applications, 2010, 19(7):1069-1079 doi: 10.1007/s00521-010-0378-4 [8] 郭潇逍, 李程, 梅俏竹.深度学习在游戏中的应用.自动化学报, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtmlGuo Xiao-Xiao, Li Cheng, Mei Qiao-Zhu. Deep learning applied to games. Acta Automatica Sinica, 2016, 42(5):676-684 http://www.aas.net.cn/CN/abstract/abstract18857.shtml [9] 田渊栋.阿法狗围棋系统的简要分析.自动化学报, 2016, 42(5):671-675 http://www.aas.net.cn/CN/abstract/abstract18856.shtmlTian Yuan-Dong. A simple analysis of AlphaGo. Acta Automatica Sinica, 2016, 42(5):671-675 http://www.aas.net.cn/CN/abstract/abstract18856.shtml [10] 段艳杰, 吕宜生, 张杰, 赵学亮, 王飞跃.深度学习在控制领域的研究现状与展望.自动化学报, 2016, 42(5):643-654 http://www.aas.net.cn/CN/abstract/abstract18852.shtmlDuan Yan-Jie, Lv Yi-Sheng, Zhang Jie, Zhao Xue-Liang, Wang Fei-Yue. Deep learning for control:the state of the art and prospects. Acta Automatica Sinica, 2016, 42(5):643-654 http://www.aas.net.cn/CN/abstract/abstract18852.shtml [11] 耿杰, 范剑超, 初佳兰, 王洪玉.基于深度协同稀疏编码网络的海洋浮筏SAR图像目标识别.自动化学报, 2016, 42(4):593-604 http://www.aas.net.cn/CN/abstract/abstract18846.shtmlGeng Jie, Fan Jian-Chao, Chu Jia-Lan, Wang Hong-Yu. Research on marine floating raft aquaculture SAR image target recognition based on deep collaborative sparse coding network. Acta Automatica Sinica, 2016, 42(4):593-604 http://www.aas.net.cn/CN/abstract/abstract18846.shtml [12] Deng L, Hinton G, Kingsbury B. New types of deep neural network learning for speech recognition and related applications:an overview. In:Proceedings of the 2013 International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, BC, Canada:IEEE, 2013. 8599-8603 [13] Erhan D, Courville A, Bengio Y, Vincent P. Why does unsupervised pre-training help deep learning? In:Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS). Chia Laguna Resort, Sardinia, Italy:AISTATS, 2010. 201-208 [14] Smolensky P. Information processing in dynamical systems:foundations of harmony theory. Parallel Distributed Processing:Explorations in the Microstructure of Cognition, vol.1:Foundations. Cambridge:MIT Press, 1986. 194-281 [15] Hinton G E. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002, 14(8):1771-1800 doi: 10.1162/089976602760128018 [16] Tieleman T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In:Proceedings of the 25th International Conference on Machine Learning. New York:ACM, 2008. 1064-1071 [17] Tieleman T, Hinton G. Using fast weights to improve persistent contrastive divergence. In:Proceedings of the 26th International Conference on Machine Learning (ICML). Montreal, Quebec, Canada:ACM, 2009. 1033-1040 [18] Desjardins G, Courville A C, Bengio Y, Vincent P, Dellaleau O. Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines. In:Proceedings of the 13th International Workshop on Artificial Intelligence and Statistics (AISTATS). Chia Laguna Resort, Sardinia, Italy:AISTATS, 2010. 45-152 [19] Cho K, Raiko T, Ilin A. Parallel tempering is efficient for learning restricted Boltzmann machines. In:Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona, Spain:IEEE, 2010. 3246-3253 [20] Brakel P, Dieleman S, Schrauwen B. Training restricted Boltzmann machines with multi-tempering:harnessing parallelization. In:European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN). Belgium:Evere, 2012. 287-292 [21] Fischer A, Igel C. Training restricted Boltzmann machines:an introduction. Pattern Recognition, 2014, 47(1):25-39 doi: 10.1016/j.patcog.2013.05.025 [22] Polyak B T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964, 4(5):1-17 doi: 10.1016/0041-5553(64)90137-5 [23] Fischer A, Igel C. Empirical analysis of the divergence of Gibbs sampling based learning algorithms for restricted Boltzmann machines. Artificial Neural Networks. Berlin Heidelberg:Springer, 2010. 208-217 [24] Hinton G E. A practical guide to training restricted Boltzmann machines. Neural Networks:Tricks of the Trade (Second edition). Berlin Heidelberg:Springer, 2012. 599-619 [25] Sutskever I, Martens J, Dahl G, Hinton G. On the importance of initialization and momentum in deep learning. In:Proceedings of the 30th International Conference on Machine Learning. Atlanta, Georgia, USA:ICML, 2013. 1139-1147 [26] Zarȩba S, Gonczarek A, Tomczak J M, Świątek J. Accelerated learning for restricted Boltzmann machine with momentum term. Progress in Systems Engineering. Switzerland:Springer International Publishing, 2015. 330:187-192 [27] Bengio Y, Delalleau O. Justifying and generalizing contrastive divergence. Neural Computation, 2009, 21(6):1601-1621 doi: 10.1162/neco.2008.11-07-647 [28] Carreira-Perpiñán M Á, Hinton G E. On contrastive divergence learning. In:Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS). Barbados:The Society for Artificial Intelligence and Statistics, 2005. 59-66 [29] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11):2278-2324 doi: 10.1109/5.726791 [30] Krizhevsky A. Learning multiple layers of features from tiny images[Master dissertation], University of Toronto, Toronto, Canada, 2009. [31] Roweis S. available:http://www.cs.nyu.edu/~roweis/, July 2, 2016. [32] Torralba A, Fergus R, Freeman W T. 80 million tiny images:a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(11):1958-1970 doi: 10.1109/TPAMI.2008.128 [33] LeCun Y, Huang F J, Bottou L. Learning methods for generic object recognition with invariance to pose and lighting. In:Proceedings of the 2004 IEEE Computer Society Conference Computer Vision and Pattern Recognition. Washington, DC, USA:IEEE, 2004. 2(2):Ⅱ-97-104 https://nyuscholars.nyu.edu/en/publications/learning-methods-for-generic-object-recognition-with-invariance-t