Clustering Boundary Pattern Discovery for High Dimensional Space Base on Matrix Model
-
摘要: 流形学习关注于寻找合适的嵌入方式将高维空间映射至低维空间,但映射子空间依然可能具有较高的维度,难以解决高维空间的数据挖掘任务.本文建立一种简单的矩阵模型判断数据点k近邻空间关于该点的对称性,并使用对称率进行边界提取,提出一种基于矩阵模型的高维聚类边界检测技术(Clustering boundary detection based on matrix model,MMC).该模型构造简单、直接、易于理解和使用.理论分析以及在人工合成和真实数据集的实验结果表明MMC算法能够有效地检测出低维和高维空间的聚类边界.Abstract: Manifold learning aims to find a reasonable embed mode to map a high-dimensional space to a low dimensional space. However, the dimension of the latter may still be so high that any data mining task cannot be effectively finished. This paper proposes a simple matrix model to judge the symmetry of data object and its k nearest neighbors space, and use the symmetry rate to extract the clustering boundary. Finally, the MMC algorithm is developed. Theoretical analysis and experimental results show that the MMC can effectively detect the clustering boundary of low and high dimensional spaces.
-
Key words:
- High dimensional space /
- clustering boundary /
- martin model /
- k nearest neighbors /
- symmetry rate
1) 本文责任编委 张军平 -
表 1 预处理方式
Table 1 Pretreatment methods
数据集 样本总数 维数 预处理方式 Mnist 10 000 28 3) Colon 62 2 000 1) Prostate 102 10 509 2) Pointing data 2 790 384 3) 表 2 不同算法在不同数据集上聚类边界检测结果
Table 2 The clustering boundary detection results of different algorithms on different data sets
数据集 维度 算法 真实边界数 检测边界数 检测正确边界数 准确率 召回率 F-measure DS1 2 BAND 640 823 556 0.6756 0.8688 0.7601 BORDER 723 540 0.7469 0.8438 0.7924 BRINK 667 520 0.7795 0.8125 0.7957 BRIM 680 536 0.7882 0.8375 0.8121 BERGE 662 532 0.8036 0.8313 0.8172 MMC 630 576 0.9143 0.9000 0.9071 DS2 2 BAND 538 749 454 0.6061 0.8439 0.7055 BORDER 669 445 0.6366 0.8271 0.7195 BRINK 499 438 0.8778 0.8141 0.8447 BRIM 562 466 0.8292 0.8661 0.8472 BERGE 553 472 0.8535 0.8773 0.8652 MMC 549 503 0.9162 0.9349 0.9255 DS3 2 BAND 1 077 1 629 961 0.5899 0.8923 0.7103 BORDER 1 252 831 0.6637 0.7716 0.7136 BRINK 1 540 914 0.5935 0.8478 0.6985 BRIM 1 188 935 0.7870 0.8682 0.8256 BERGE 1 138 942 0.8278 0.8747 0.8506 MMC 1 016 968 0.9528 0.8988 0.9250 DS4 2 BAND 1 204 1 944 1 056 0.5432 0.8771 0.6709 BORDER 1 802 1 089 0.6043 0.9045 0.7246 BRINK 1 817 1 003 0.5520 0.8331 0.6640 BRIM 1 355 1 062 0.7838 0.8821 0.8300 BERGE 1 246 1 123 0.9013 0.9327 0.9167 MMC 1 228 1 138 0.9267 0.9452 0.9359 Biomed 4 BAND 30 26 22 0.8462 0.7333 0.7857 BORDER 26 23 0.8846 0.7667 0.8214 BRINK 36 30 0.8333 1.0000 0.9089 BERGE 26 24 0.9231 0.8000 0.8572 MMC 30 28 0.9333 0.9333 0.9333 Cancer 10 BAND 37 37 25 0.6757 0.6757 0.6757 BORDER 37 28 0.7568 0.7568 0.7568 BRINK 37 29 0.7837 0.7837 0.7837 BERGE 37 28 0.7568 0.7568 0.7568 MMC 38 34 0.8947 0.9189 0.9067 Colon 2 000 BAND 7 6 5 0.8333 0.7143 0.7692 BORDER 7 7 1.0000 1.0000 1.0000 BRINK 6 5 0.8333 0.7143 0.7692 BERGE 6 5 0.8333 0.7143 0.7692 MMC 7 7 1.0000 1.0000 1.0000 Prostate 10 509 BAND 18 17 16 0.9412 0.8889 0.9143 BORDER 19 18 0.9474 1.0000 0.9730 BRINK 17 16 0.9412 0.8889 0.9143 BERGE 17 16 0.9412 0.8889 0.9143 MMC 18 18 1.0000 1.0000 1.0000 -
[1] Tsapanos N, Tefas A, Nikolaidis N, Pitas I. A distributed framework for trimmed kernel k-means clustering. Pattern Recognition, 2015, 48(8):2685-2698 doi: 10.1016/j.patcog.2015.02.020 [2] Guo Y H, Sengur A. NCM:neutrosophic c-means clustering algorithm. Pattern Recognition, 2015, 48(8):2710-2724 doi: 10.1016/j.patcog.2015.02.018 [3] Vikjord V V, Jenssen R. Information theoretic clustering using a k-nearest neighbors approach. Pattern Recognition, 2014, 47(9):3070-3081 doi: 10.1016/j.patcog.2014.03.018 [4] Jain A K. Data clustering:50 years beyond k-means. Pattern Recognition Letters, 2010, 31(8):651-666 doi: 10.1016/j.patrec.2009.09.011 [5] Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data. Data Mining and Knowledge Discovery, 2015, 11(1):5-33 doi: 10.1007-s10618-005-1396-1/ [6] Dai L Z, Ding J D, Yang J. Inhomogeneity-embedded active contour for natural image segmentation. Pattern Recognition, 2015, 48(8):2513-2529 doi: 10.1016/j.patcog.2015.03.001 [7] Aja-Fernández S, Curiale A H, Vegas-Sánchez-Ferrero G. A local fuzzy thresholding methodology for multiregion image segmentation. Knowledge-Based Systems, 2015, 83(1):1-12 [8] Peng P, Addam O, Elzohbi M, Özyer S T, Elhajj A, Gao S, Liu Y M, Özyer T, Kaya M, Ridley M, Rokne J, Alhajj R. Reporting and analyzing alternative clustering solutions by employing multi-objective genetic algorithm and conducting experiments on cancer data. Knowledge-Based Systems, 2014, 56(3):108-122 [9] Kaur P, Soni A K, Gosain A. RETRACTED:a robust kernelized intuitionistic fuzzy c-means clustering algorithm in segmentation of noisy medical images. Pattern Recognition Letters, 2013, 34(2):163-175 doi: 10.1016/j.patrec.2012.09.015 [10] Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data:a review. ACM SIGKDD Explorations Newsletter, 2004, 6(1):90-105 doi: 10.1145/1007730 [11] Angiulli F, Pizzuti C. Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(2):203-215 doi: 10.1109/TKDE.2005.31 [12] Ester M, Kriegel H P, Sander J, Xu X W. A density-based algorithm for discovering clusters in large spatial databases with noise. In:Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96). Portland, Oregon:Association for the Advancement of Artificial Intelligence, 1996. 226-231 [13] Xia C Y, Hsu W, Lee M L, Ooi B C. BORDER:efficient computation of boundary points. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(3):289-303 doi: 10.1109/TKDE.2006.38 [14] Achtert E, Böhm C, Kröger P, Kunath P, Pryakhin A, Renz M. Efficient reverse k-nearest neighbor estimation. Informatik-Forschung und Entwicklung, 2007, 21(3-4):179-195 doi: 10.1007/s00450-007-0027-z [15] Qiu B Z, Yue F, Shen J Y. BRIM:an efficient boundary points detecting algorithm. Advances in Knowledge Discovery and Data Mining. Berlin Heidelberg:Springer, 2007. 761-768 [16] 薛丽香, 邱保志.基于变异系数的边界点检测算法.模式识别与人工智能, 2009, 22(5):799-802 http://d.wanfangdata.com.cn/Periodical/mssbyrgzn200905020Xue Li-Xiang, Qiu Bao-Zhi. Boundary points detection algorithm based on coefficient of variation. Pattern Recognition and Artificial Intelligence, 2009, 22(5):799-802 http://d.wanfangdata.com.cn/Periodical/mssbyrgzn200905020 [17] 邱保志, 杨洋, 杜效伟. BRINK:基于局部质变因子的聚类边界检测算法.郑州大学学报(工学版), 2012, 33(3):117-120Qiu Bao-Zhi, Yang Yang, Du Xiao-Wei. BRINK:an algorithm of boundary points of clusters detection based on local qualitative factors. Journal of Zhengzhou University (Engineering Science), 2012, 33(3):117-120 [18] 李向丽, 耿鹏, 邱保志.混合属性数据集的聚类边界检测技术.控制与决策, 2015, 30(1):171-175 http://d.wanfangdata.com.cn/Periodical/kzyjc201501030Li Xiang-Li, Geng Peng, Qiu Bao-Zhi. Clustering boundary detection technology for mixed attribute data set. Control and Decision, 2015, 30(1):171-175 http://d.wanfangdata.com.cn/Periodical/kzyjc201501030 [19] Roweis S T, Saul L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000, 290(5500):2323-2326 doi: 10.1126/science.290.5500.2323 [20] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In:Advances in Neural Information Processing Systems 14:Proceedings of the 2001 Conference. Cambridge MA:MIT Press, 2001. 585-591 [21] He X, Niyogi P. Locality preserving projections. Advances in Neural Information Processing Systems, 2003, 16(1):186-197 http://d.wanfangdata.com.cn/Periodical/rjxb201006008 [22] Zhang Z Y, Zha H Y. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM Journal on Scientific Computing, 2004, 26(1):313-338 doi: 10.1137/S1064827502419154 [23] 郑思龙, 李元祥, 魏宪, 彭希帅.基于字典学习的非线性降维方法.自动化学报, 2016, 42(7):1065-1076 http://www.aas.net.cn/CN/abstract/abstract18897.shtmlZheng Si-Long, Li Yuan-Xiang, Wei Xian, Peng Xi-Shuai. Nonlinear dimensionality reduction based on dictionary learning. Acta Automatica Sinica, 2016, 42(7):1065-1076 http://www.aas.net.cn/CN/abstract/abstract18897.shtml [24] Taşdemir K, Yalçin B, Yildirim I. Approximate spectral clustering with utilized similarity information using geodesic based hybrid distance measures. Pattern Recognition, 2015, 48(4):1465-1477 doi: 10.1016/j.patcog.2014.10.023 [25] Hearst M A, Dumais S T, Osman E, Platt J, Scholkopf B. Support vector machines. IEEE Intelligent Systems and their Applications, 1998, 13(4):18-28 doi: 10.1109/5254.708428 [26] Chua L O, Yang L. Cellular neural networks:theory. IEEE Transactions on Circuits and Systems, 1998, 35(10):1257-1272 http://d.wanfangdata.com.cn/Periodical/dlxtzdh201405011 [27] Zimdahl H, Hübner N. Gene chip technology and its application to molecular medicine. Encyclopedic Reference of Genomics and Proteomics in Molecular Medicine. Berlin Heidelberg:Springer, 2006. 650-655 [28] Ferdous M M, Vinciotti V, Liu X H, Wilson P. Exploring the link between gene expression and protein binding by integrating mRNA microarray and ChIP-Seq data. Statistical Learning and Data Sciences. Switzerland:Springer International Publishing, 2015. 214-222 [29] The Data and Story Library. Biomed data set[Online], available:http://lib.stat.cmu.edu/datasets/biomed.data.html, October 24, 2017 [30] UCI Machine Learning Repository. Cancer data set[Online], available:http://archive.ics.uci.edu/ml/datasets.html, October 24, 2017 [31] Princeton University Gene Expression Project. Colon data set[Online], available:http://genomics-pubs.princeton.edu/oncology/affydata/index.html, October 24, 2017 [32] Gene Expression Model Selector. Prostate data set[Online], available:http://www.gems-system.org, October 24, 2017 [33] Zhu Q, Xin H. Feature extraction and filter in handwritten numeral recognition. Geo-Informatics in Resource Management and Sustainable Ecosystem. Berlin Heidelberg:Springer, 2013. 58-67 [34] Weber-Alonso J M, Sesmero M P, Gutierrez G, Ledezma A, Sanchis A. Input transformation and output combination for improved handwritten digit recognition. Artificial Neural Networks. Switzerland:Springer International Publishing, 2015, 4:435-443 [35] Wang Y M, Peyls A, Pan Y, Claesen L, Yan X L. A fast self-organizing map algorithm for handwritten digit recognition. Multimedia and Ubiquitous Engineering. Berlin Heidelberg:Springer, 2013, 240:177-183 [36] Jia W J, Zhang H F, He X J. Region-based license plate detection. Journal of Network and Computer Applications, 2007, 30(4):1324-1333 doi: 10.1016/j.jnca.2006.09.010 [37] Zhou W G, Li H Q, Lu Y G, Tian Q. Principal visual word discovery for automatic license plate detection. IEEE Transactions on Image Processing, 2012, 21(9):4269-4279 doi: 10.1109/TIP.2012.2199506 [38] THE MNIST DATABASE. Mnist data set[Online], available:http://yann.SMCun.com/exdb/mnist/, October 24, 2017 [39] Huang S C, Chen J, Luo Z. RETRACTED ARTICLE:sparse tensor CCA for color face recognition. Neural Computing and Applications, 2014, 24(7-8):1647-1658 doi: 10.1007/s00521-013-1387-x [40] Bhaskar B, Mahantesh K, Geetha G P. An investigation of fSVD and ridgelet transform for illumination and expression invariant face recognition advances in intelligent informatics. Advances in Intelligent Informatics. Switzerland:Springer International Publishing, 2015, 320:31-38 [41] Dang K D, Le T H. Local region partitioning for disguised face recognition using non-negative sparse coding. Advanced Methods for Computational Collective Intelligence. Berlin Heidelberg:Springer, 2013, 457:197-206 [42] Head Pose Image Database. Pointing'04 dat set, [Online], available:http://www-prima.inrialpes.fr/Pointing04, October 24, 2017