-
摘要: 处理不平衡数据分类时,传统模糊系统对少数类样本识别率较低.针对这一问题,首先,在前件参数学习上,提出了竞争贝叶斯模糊聚类(Bayesian fuzzy clustering based on competitive learning,BFCCL)算法,BFCCL算法考虑不同类别样本聚类中心间的排斥作用,采用交替迭代的执行方式并通过马尔科夫蒙特卡洛方法获得模型参数最优解.其次,在后件参数学习上,基于大间隔的策略并通过参数调节使得少数类到分类面的距离大于多数类到分类面的距离,该方法能有效纠正分类面的偏移.基于上述思想以0阶TSK型模糊系统为具体研究对象构造了适用于不平衡数据分类问题的0阶TSK型模糊系统(0-TSK-IDC).人工和真实医学数据集实验结果表明,0-TSK-IDC在不平衡数据分类问题中对少数类和多数类均具有较高的识别率,且具有良好的鲁棒性和可解释性.
-
关键词:
- 不平衡数据 /
- 分类 /
- 马尔科夫蒙特卡洛 /
- Takagi-Sugeno-Kang型模糊系统
Abstract: When learning from imbalanced datasets, the traditional fuzzy systems have a low rate of identification over the minority class. Firstly, in the antecedent parameter learning stage, a new clustering method, called Bayesian fuzzy clustering based on competitive learning (BFCCL), is proposed to partition the input space for the antecedents of if-then rules. BFCCL considers the repulsed force of clustering prototypes between different classes, and uses an alternating iterative strategy to obtain the optimal model parameters by Markov chain Monte Carlo method. Secondly, in the consequent parameter learning stage, based on the maximum separation strategy and by keeping the distance between the minority class and the classification hyperplane larger than the distance between the majority class and the hyperplane, the method can effectively correct the skewness of the classification hyperplane. Based on the above ideas, a zero-orderTakagi-Sugeno-Kang fuzzy system for imbalanced data classification (0-TSK-IDC) is proposed. Experimental results on artificial and real-world medicine datasets illustrate the effectiveness of 0-TSK-IDC on both minority and majority classes in imbalanced data classification, as well as its good robustness and interpretability.1) 本文责任编委 王立威 -
表 1 数据集的基本信息
Table 1 The basic information of datasets
数据集 正类样本数 负类样本数 正负类比例 属性个数 Banana dataset 600, 200, 100 1 500 2 : 5, 2 : 15, 1 : 15 2 Heart statlog 120, 60, 30, 20, 12 170 12 : 17, 6 : 17, 3 : 17, 2 : 17, 6 : 85 13 Breast wisconsin 241, 200, 150, 100, 40 458 241 : 458, 100 : 229, 75 : 229, 50 : 229, 20 : 229 10 Liver disorders 145, 100, 50, 20 200 29 : 40, 1 : 2, 1 : 4, 1 : 10 7 Haberman 81, 40, 25 225 9 : 25, 8 : 45, 1 : 9 3 表 3 UCI医学集上分别使用BFCCL与BFC得到规则前件时0-TSK-IDC模糊系统中的G-mean和F-measure值及其方差的比较
Table 3 G-mean, F-measure and their standard deviations comparison of 0-TSK-IDC with BFC and BFCCL on UCI datasets
数据集 正负类比例 BFC BFCCL G-mean (%) F-measure (%) G-mean (%) F-measure (%) Heart 12:17 87.01(1.69) 86.87(1.78) 89.56(1.91) 89.36(1.90) 6:17 86.24(2.00) 85.48(1.88) 88.14(1.89) 87.88(1.92) 3:17 85.41(1.97) 84.08(2.00) 87.29(2.01) 86.40(2.00) 2:17 82.50(2.30) 80.02(2.31) 85.71(2.24) 83.63(2.23) 6:85 81.05(2.17) 75.37(2.19) 84.65(2.11) 78.50(2.04) Breast 241:458 93.62(2.60) 91.46(2.57) 96.56(2.34) 95.03(2.20) 100:229 91.14(2.05) 90.02(2.55) 95.59(1.97) 94.24(2.01) 75:229 90.37(2.00) 89.14(2.04) 93.75(1.88) 91.22(1.89) 50:229 87.99(1.90) 85.73(1.89) 91.59(2.03) 89.29(2.10) 20:229 84.21(2.23) 81.26(2.21) 87.56(2.00) 86.05(1.99) Liver 29:40 70.38(0.80) 66.28(0.82) 72.50(0.77) 68.51(0.76) 1:2 69.77(0.75) 61.27(0.75) 71.15(0.69) 62.50(0.60) 1:4 67.82(0.79) 52.98(0.78) 70.24(0.73) 55.22(0.79) Haberman 1:10 65.08(0.81) 47.31(0.83) 67.18(0.75) 50.65(0.72) 9:25 76.05(1.73) 52.75(1.73) 76.56(1.60) 53.61(1.60) 8:45 68.02(1.86) 51.09(1.85) 68.97(1.85) 52.60(1.87) 1:9 64.21(1.69) 48.22(1.73) 65.42(1.74) 50.01(1.69) 表 2 Banana集上基于BFC与BFCCL 图 4~7聚类结果的0-TSK-IDC模糊系统中的G-mean和F-measure及其方差的比较
Table 2 G-mean, F-measure and their standard deviations comparison of 0-TSK-IDC with the clustering results in Fig. 4~7 by using the BFC and BFCCL on the Banana dataset
规则数 正负类比例 BFC BFCCL G-mean (%) F-measure (%) G-mean (%) F-measure (%) 6 2:5 96.48(0.60) 96.2(0.64) 96.97(0.53) 96.44(0.54) 2:15 95.89(0.52) 95.77(0.49) 96.20(0.31) 96.22(0.36) 1:15 94.14(0.47) 93.79(0.41) 95.45(0.55) 94.99(0.48) 8 2:5 97.98(0.31) 97.23(0.34) 99.75(0.27) 99.74(0.23) 2:15 97.03(0.29) 96.92(0.29) 99.32(0.34) 99.32(0.35) 1:15 96.76(0.36) 96.65(0.32) 98.68(0.30) 98.63(0.33) 表 4 Banana数据集上0-TSK-IDC模糊分类器与其他算法的G-mean和F-measure值及其方差的比较
Table 4 G-mean, F-measure and their standard deviations comparison of 0-TSK-IDC and other algorithms on the Banana dataset
正负类样本数 算法 G-mean (%) F-measure (%) 2:5 FS-FCSVM 95.90(0.89) 95.31(0.84) L2-TSK-FS 96.26(0.47) 95.48(0.45) BFCCL-TSK-FS 96.79(0.50) 96.14(0.41) Adaboost 98.77(0.87) 98.71(0.90) CS-SVM 98.98(0.33) 98.91(0.30) 0-TSK-IDC 99.75(0.27) 99.74(0.23) 2:15 FS-FCSVM 90.53(0.65) 89.46(0.60) L2-TSK-FS 89.23(0.71) 88.47(0.72) BFCCL-TSK-FS 92.70(0.58) 92.26(0.62) Adaboost 97.92(0.64) 97.75(0.68) CS-SVM 98.22(0.37) 98.05(0.36) 0-TSK-IDC 99.32(0.34) 99.32(0.35) 1:15 FS-FCSVM 86.06(0.81) 82.83(0.84) L2-TSK-FS 87.95(0.55) 84.67(0.54) BFCCL-TSK-FS 88.84(0.43) 86.33(0.49) Adaboost 97.46(0.58) 97.28(0.52) CS-SVM 97.79(0.74) 97.61(0.70) 0-TSK-IDC 98.68(0.30) 98.63(0.33) -
[1] Richardson J, Korniak J, Reiner P D, Wilamowski B M. Nearest-neighbor spline approximation (NNSA) improvement to TSK fuzzy systems. IEEE Transactions on Industrial Informatics, 2016, 12(1):169-178 doi: 10.1109/TII.2015.2499122 [2] Deng Z H, Cao L B, Jiang Y Z, Wang S T. Minimax probability TSK fuzzy system classifier:a more transparent and highly interpretable classification model. IEEE Transactions on Fuzzy Systems, 2015, 23(4):813-826 doi: 10.1109/TFUZZ.2014.2328014 [3] 贾立, 杨爱华, 邱铭森.基于多信号源的神经模糊HammersteinWiener模型研究.自动化学报, 2013, 39(5):690-696 http://www.aas.net.cn/CN/abstract/abstract17931.shtmlJia Li, Yang Ai-Hua, Qiu Ming-Sen. Research on multisignal based neuro-fuzzy Hammerstein-Wiener model. Acta Automatica Sinica, 2013, 39(5):690-696 http://www.aas.net.cn/CN/abstract/abstract17931.shtml [4] Liu Y J, Tong S C, Chen C L P, Li D J. Neural controller design-based adaptive control for nonlinear MIMO systems with unknown hysteresis inputs. IEEE Transactions on Cybernetics, 2016, 46(1):9-19 doi: 10.1109/TCYB.2015.2388582 [5] Cheng W Y, Juang C F. A fuzzy model with online incremental SVM and margin-selective gradient descent learning for classification problems. IEEE Transactions on Fuzzy Systems, 2014, 22(2):324-337 doi: 10.1109/TFUZZ.2013.2254492 [6] Jiang Y Z, Chung F L, Ishibuchi H, Deng Z H, Wang S T. Multitask TSK fuzzy system modeling by mining intertask common hidden structure. IEEE Transactions on Cybernetics, 2015, 45(3):534-547 doi: 10.1109/TCYB.2014.2330844 [7] Liu Y J, Tong S C. Adaptive fuzzy control for a class of unknown nonlinear dynamical systems. Fuzzy Sets and Systems, 2015, 263:49-70 doi: 10.1016/j.fss.2014.08.008 [8] Wong S Y, Yap K S, Yap H J, Tan S C, Chang S W. On equivalence of FIS and ELM for interpretable rule-based knowledge representation. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(7):1417-1430 doi: 10.1109/TNNLS.2014.2341655 [9] Leski J M. TSK-fuzzy modeling based on ε-insensitive learning. IEEE Transactions on Fuzzy Systems, 2005, 13(2):181-193 doi: 10.1109/TFUZZ.2004.840094 [10] Leski J M. Fuzzy (c+p)-means clustering and its application to a fuzzy rule-based classifier:toward good generalization and good interpretability. IEEE Transactions on Fuzzy Systems, 2015, 23(4):802-812 doi: 10.1109/TFUZZ.2014.2327995 [11] Fernández A, del Jesus M J, Herrera F. On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets. Information Sciences, 2010, 180(8):1268-1291 doi: 10.1016/j.ins.2009.12.014 [12] Fernández A, del Jesus M, Herrera F. Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets. International Journal of Approximate Reasoning, 2009, 50(3):561-577 doi: 10.1016/j.ijar.2008.11.004 [13] Ramentol E, Caballero Y, Bello R, Herrera F. SMOTERSB*:a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-Sets using SMOTE and rough sets theory. Knowledge and Information Systems, 2012, 33(2):245-265 doi: 10.1007/s10115-011-0465-6 [14] López V, Fernández A, del Jesus M, Herrera F. A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline datasets. Knowledge Based Systems, 2013, 38:85-104 doi: 10.1016/j.knosys.2012.08.025 [15] Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost:enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12):3460-3471 doi: 10.1016/j.patcog.2013.05.006 [16] Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE:synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1):321-357 https://medium.com/erinludertblog/smote-synthetic-minority-over-sampling-technique-caada3df2c0a [17] Sun Y M, Kamel M S, Wong A K C, Wang Y. Costsensitive boosting for classification of imbalanced data. Pattern Recognition, 2007, 40(12):3358-3378 doi: 10.1016/j.patcog.2007.04.009 [18] Tang Y C, Zhang Y Q, Chawla N V, Krasser S. SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics, 2009, 39(1) 281-288 doi: 10.1109/TSMCB.2008.2002909 [19] Deng Z H, Jiang Y Z, Chung F L, Ishibuchi H, Wang S T. Knowledge-leverage-based fuzzy system and its modeling. IEEE Transactions on Fuzzy Systems, 2013, 21(4):597-609 doi: 10.1109/TFUZZ.2012.2212444 [20] Zhu L, Chung F L, Wang S T. Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics, 2009, 39(3):578-591 doi: 10.1109/TSMCB.2008.2004818 [21] Deng Z H, Choi K S, Chung F L, Wang S T. Enhanced soft subspace clustering integrating within-cluster and betweencluster information. Pattern Recognition, 2010, 43(3):767-781 doi: 10.1016/j.patcog.2009.09.010 [22] Glenn T C, Zare A, Gader P D. Bayesian fuzzy clustering. IEEE Transactions on Fuzzy Systems, 2015, 23(5):1545-1561 doi: 10.1109/TFUZZ.2014.2370676 [23] 蒋亦樟, 邓赵红, 王士同. ML型迁移学习模糊系统.自动化学报, 2012, 38(9):1393-1409 http://www.aas.net.cn/CN/abstract/abstract17749.shtmlJiang Yi-Zhang, Deng Zhao-Hong, Wang Shi-Tong. Mamdani-Larsen type transfer learning fuzzy system. Acta Automatica Sinica, 2012, 38(9):1393-1409 http://www.aas.net.cn/CN/abstract/abstract17749.shtml [24] Azeem M F, Hanmandlu M, Ahmad N. Generalization of adaptive neuro-fuzzy inference systems. IEEE Transactions on Neural Networks, 2000, 11(6):1332-1346 doi: 10.1109/72.883438 [25] Deng Z H, Choi K S, Chung F L, Wang S T. Scalable TSK fuzzy modeling for very large datasets using minimalenclosing-ball approximation. IEEE Transactions on Fuzzy Systems, 2011, 19(2):210-226 doi: 10.1109/TFUZZ.2010.2091961 [26] Hall L O, Goldgof D B. Convergence of the single-pass and online fuzzy C-means algorithms. IEEE Transactions on Fuzzy Systems, 2011, 19(4):792-794 doi: 10.1109/TFUZZ.2011.2143418 [27] Meyn S P, Tweedie R L. Markov Chains and Stochastic Stability. London:Springer, 1993. [28] Nesterov Y. Introductory Lectures on Convex Optimization:A Basic Course. US:Springer, 2004. [29] Vapnik V N. Statistical Learning Theory. New York:Wiley, 1998. [30] Ni T G, Chung F L, Wang S T. Support vector machine with manifold regularization and partially labeling privacy protection. Information Sciences, 2015, 294:390-407 doi: 10.1016/j.ins.2014.09.050 [31] UCI database[Online], available:http://www.ics.uci.edu/. [32] Juang C F, Chiu S H, Shiu S J. Fuzzy system learned through fuzzy clustering and support vector machine for human skin color segmentation. IEEE Transactions on Systems, Man, and Cybernetics-Part A:Systems and Humans, 2007, 37(6):1077-1087 doi: 10.1109/TSMCA.2007.904579 [33] Wang S, Yao X. Multiclass imbalance problems:analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics, 2012, 42(4):1119-1130 doi: 10.1109/TSMCB.2012.2187280 [34] Masnadi-Shirazi H, Vasconcelos N, Iranmehr A. Costsensitive support vector machines. Journal of Machine Learning Research, 2012, arXiv:1212.0975 http://en.cnki.com.cn/article_en/cjfdtotal-kzyc200604024.htm [35] Bezdek J C. A physical interpretation of fuzzy ISODATA. IEEE Transactions on Systems, Man, and Cybernetics, 1976, SMC-6(5):387-389 doi: 10.1109/TSMC.1976.4309506 [36] Sun Z B, Song Q B, Zhu X Y, Sun H L, Xu B W, Zhou Y M. A novel ensemble method for classifying imbalanced data. Pattern Recognition, 2015, 48(5):1623-1637 doi: 10.1016/j.patcog.2014.11.014 [37] Parambath S A P, Usunier N, Grandvalet Y. Optimizing F-measures by cost-sensitive classification. In:Proceedings of Advances in Neural Information Processing Systems 27. Montreal, Canada:Curran Associates, Inc., 2014. 2123-2131