Towards Automatic Smart-contract Codes Classification by Means of Word Embedding Model and Transaction Information
-
摘要: 作为区块链技术的一个突破性扩展,智能合约允许用户在区块链上实现个性化的代码逻辑从而使得区块链技术更加的简单易用.在智能合约代码信息迅速增长的背景下,如何管理和组织海量智能合约代码变得更具挑战性.基于人工智能技术的代码分类系统能根据代码的文本信息自动分门别类,从而更好地帮助人们管理和组织代码的信息.本文以Ethereum平台上的智能合约为例,鉴于词嵌入模型可以捕获代码的语义信息,提出一种基于词嵌入模型的智能合约分类系统.另外,每一个智能合约都关联着一系列交易,我们又通过智能合约的交易信息来更深入地了解智能合约的逻辑行为.据我们所知,本文是对智能合约代码自动分类问题的首次研究尝试.测试结果显示该系统具有较为令人满意的分类性能.Abstract: As an innovative extension of the blockchain technology, smart contract enables users to implement personalized logic. As such, blockchain technology becomes more simple and useful. However, due to the rapid increase of the amount of smart contract codes, managing smart contract codes is becoming much more challenging. Automatic code classifier, which rests on the machine learning methods, can automatically identify the categories of the codes so as to saves a lot of human efforts. In this paper we investigate the smart contract codes of the Ethereum platform and propose a novel smart contract code classifier. To the best of our knowledge, this is the first exploration on automatic classification of the smart contract codes. The classifier is based on the word embedding model. Since each smart contract corresponds to a series of transactions, we further utilize the transactions in the contract to understand the intrinsic logic of the contract. Extensive experiments have verified the effectiveness of our proposed system.
-
Key words:
- Smart contract /
- codes /
- transaction information /
- word embedding /
- neural network /
- long-short term memory
1) 本文责任编委 袁勇 -
表 1 神经网络分类效果
Table 1 Neural network classification effect
类别 有交易信息 无交易信息 Precision Recall Accuracy F1 score Precision Recall Accuracy F1 score 金融类 0.943 0.945 0.942 0.943 0.872 0.868 0.882 0.869 游戏类 0.924 0.897 0.924 0.910 0.895 0.874 0.886 0.884 彩票类 0.882 0.891 0.906 0.886 0.835 0.852 0.875 0.843 Ethereum工具类 0.914 0.921 0.929 0.917 0.854 0.871 0.882 0.862 信息管理类 0.862 0.842 0.883 0.852 0.805 0.813 0.829 0.809 货币类 0.914 0.882 0.917 0.898 0.821 0.809 0.834 0.814 娱乐类 0.873 0.889 0.893 0.881 0.783 0.763 0.792 0.773 物联网类 0.861 0.845 0.882 0.853 0.796 0.771 0.809 0.783 其他 0.832 0.814 0.845 0.823 0.753 0.757 0.791 0.754 表 2 朴素贝叶斯分类效果
Table 2 Naive Bayesian classification effect
类别 有交易信息 无交易信息 Precision Recall Accuracy F1 score Precision Recall Accuracy F1 score 金融类 0.862 0.893 0.861 0.877 0.861 0.815 0.862 0.837 游戏类 0.866 0.879 0.883 0.872 0.815 0.826 0.837 0.820 彩票类 0.821 0.817 0.846 0.819 0.796 0.805 0.822 0.800 Ethereum工具类 0.884 0.854 0.896 0.868 0.825 0.847 0.861 0.835 信息管理类 0.829 0.859 0.860 0.852 0.757 0.771 0.796 0.764 货币类 0.876 0.853 0.896 0.864 0.760 0.765 0.774 0.762 娱乐类 0.845 0.864 0.872 0.854 0.716 0.725 0.735 0.720 物联网类 0.826 0.843 0.862 0.834 0.746 0.741 0.759 0.743 其他 0.784 0.819 0.825 0.801 0.745 0.737 0.763 0.740 表 3 支持向量机分类效果
Table 3 Support vector machine classification effect
类别 有交易信息 无交易信息 Precision Recall Accuracy F1 score Precision Recall Accuracy F1 score 金融类 0.875 0.897 0.906 0.885 0.815 0.831 0.842 0.822 游戏类 0.883 0.835 0.876 0.858 0.845 0.821 0.856 0.832 彩票类 0.879 0.846 0.887 0.862 0.855 0.793 0.814 0.822 Ethereum工具类 0.861 0.865 0.891 0.862 0.829 0.827 0.836 0.827 信息管理类 0.804 0.863 0.877 0.832 0.764 0.786 0.789 0.774 货币类 0.872 0.862 0.889 0.866 0.787 0.792 0.803 0.789 娱乐类 0.863 0.859 0.873 0.860 0.708 0.714 0.726 0.710 物联网类 0.829 0.845 0.867 0.836 0.756 0.758 0.763 0.756 其他 0.804 0.821 0.856 0.812 0.731 0.727 0.734 0.728 -
[1] Nakamoto S. Bitcoin: a peer-to-peer electronic cash system, http://www.bitcoin.org, September 7, 2017 [2] Castro M, Liskov B. Practical byzantine fault tolerance. In: Proceedings of the Third Symposium on Operating Systems Design and Implementation (OSDI), USENIX Association, 1999, 99: 173-186 [3] Pang G S, Jin H D, Jiang S Y. Cenknn: a scalable and effective text classifier. Data Mining and Knowledge Discovery, 2015, 29(3): 593-625 doi: 10.1007/s10618-014-0358-x [4] Tang B, He H B, Baggenstoss P M, Kay S. A Bayesian classification approach using class-specific features for text categorization. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(6): 1602-1606 doi: 10.1109/TKDE.2016.2522427 [5] Wahiba B A, El Fadhl Ahmed B. New fuzzy decision tree model for text classification. In: Proceedings of the 1st International Conference on Advanced Intelligent System and Informatics (AISI2015). Switzerland: Springer, 2016. 309-320 [6] Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26. Lake Tahoe, Nevada, United States: Curran Associates Inc., 2013. 3111-3119 [7] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473, 2014. [8] Liu B. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 2012, 5(1): 1-167 [9] Fleder M, Kester M S, Pillai S. Bitcoin transaction graph analysis. arXiv preprint arXiv: 1502.01657, 2015. [10] Ron D, Shamir A. Quantitative analysis of the full bitcoin transaction graph. In: Proceedings of the 17th International Conference on Financial Cryptography and Data Security. Okinawa, Japan: Springer, 2013. 6-24 [11] Shah D, Zhang K. Bayesian regression and bitcoin. In: Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton). Monticello, USA: IEEE, 2014. 409-414 [12] Luu L, Chu D H, Olickel H, Saxena P, Hobor A. Making smart contracts smarter. Cryptology ePrint Archive, Report 2016/633 [Online], available: http://eprint.iacr.org/2016/633, August 16, 2016. [13] Moore T, Christin N. Beware the middleman: empirical analysis of bitcoin-exchange risk. In: Proceedings of the 17th International Conference on Financial Cryptography and Data Security. Okinawa, Japan: Springer, 2013. 25-33 [14] Omohundro S. Cryptocurrencies, smart contracts, and artificial intelligence. AI Matters, 2014, 1(2): 19-21 doi: 10.1145/2685328 [15] Di Battista G, Di Donato V, Patrignani M, Pizzonia M, Roselli V, Tamassia R. Bitconeview: visualization of flows in the bitcoin transaction graph. In: Proceedings of the 2015 IEEE Symposium on Visualization for Cyber Security (VizSec). Chicago, USA: IEEE, 2015. 1-8 [16] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 2002, 34(1): 1-47 doi: 10.1145/505282.505283 [17] Rocchio J J. Relevance feedback in information retrieval. The SMART Retrieval System. Englewood Cliffs, N.J.: Prentice Hall, Inc., 1971. [18] Rao Y H, Li Q, Mao X D, Liu W Y. Sentiment topic models for social emotion mining. Information Sciences, 2014, 266: 90-100 doi: 10.1016/j.ins.2013.12.059 [19] Rao Y H, Xie H R, Li J, Jin F M, Wang F L, Li Q. Social emotion classification of short text via topic-level maximum entropy model. Information & Management, 2016, 53(8): 978-986 [20] Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613-620 doi: 10.1145/361219.361220 [21] Liu M Y, Yang J G. An improvement of TFIDF weighting in text categorization. In: Proceedings of the 2012 International Conference on Computer Technology and Science. Singapore: IACSIT Press, 2012. 44-47 [22] Li C H, Park S C. Combination of modified BPNN algorithms and an efficient feature selection method for text categorization. Information Processing and Management, 2009, 45(3): 329-340 doi: 10.1016/j.ipm.2008.09.004 [23] Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504-507 doi: 10.1126/science.1127647 [24] Chen Z H, Ni C W, Murphey Y L. Neural network approaches for text document categorization. In: Proceedings of the 2006 IEEE International Joint Conference on Neural Network. Vancouver, Canada: IEEE, 2006. 1054-1060 [25] Li C H, Song W, Park S C. An automatically constructed thesaurus for neural network based document categorization. Expert Systems with Applications, 2009, 36(8): 10969-10975 doi: 10.1016/j.eswa.2009.02.006 [26] Turian J, Ratinov L, Bengio Y. Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010. 384-394 [27] Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing, 2014, 12: 1532-1543 [28] Le Q V, Mikolov T. Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning. Beijing, China, 2014. 1188-1196 [29] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780 doi: 10.1162/neco.1997.9.8.1735 [30] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems 27. Montreal, Quebec, Canada: MIT Press, 2014. [31] Tim R, Grefenstette E, Hermann K M, Tomáš K, Blunsom P. Reasoning about entailment with neural attention. arXiv preprint arXiv: 1509.06664, 2015. [32] Huang P S, He X D, Gao J F, Deng L, Acero A, Heck L. Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York, NY, USA: ACM, 2013. 2333-2338 [33] Mikolov T, Karafiát M, Burget L, Černocký J, Khudanpur S. Recurrent neural network based language model. In: INTERSPEECH 2010, Conference of the International Speech Communication Association. Makuhari, Chiba, Japan: ISCA, 2010. 1045-1048 [34] Siegelmann H T, Sontag E D. On the computational power of neural nets. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory. New York, NY, USA: ACM, 1992. 440-449 [35] Buterin V. Ethereum white paper [online], available: https://github.com/ethereum/wiki/wiki/White-Paper, September 7, 2017 [36] Wood G. Ethereum: a secure decentralised generalised transaction ledger. Ethereum Project Yellow Paper, 2014. [37] Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models. In: Proceeding of the 2013 ICML Workshop on Deep Learning for Audio, Speech, and Language Processing. Atlanta, Georgia, 2013. [38] Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958 http://jmlr.csail.mit.edu/papers/v15/srivastava14a.html [39] Goodfellow I J, Warde-Farley D, Mirza M, Courville A C, Bengio Y. Maxout networks. ICML, 2013, 28(3): 1319-1327 http://jmlr.org/proceedings/papers/v28/goodfellow13.html [40] Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12: 2121-2159 http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf [41] Zeiler M D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv: 1212.5701, 2012.