Clustering Boundary Pattern Discovery for High Dimensional Space Base on Matrix Model
摘要: 流形学习关注于寻找合适的嵌入方式将高维空间映射至低维空间,但映射子空间依然可能具有较高的维度,难以解决高维空间的数据挖掘任务.本文建立一种简单的矩阵模型判断数据点k近邻空间关于该点的对称性,并使用对称率进行边界提取,提出一种基于矩阵模型的高维聚类边界检测技术(Clustering boundary detection based on matrix model,MMC).该模型构造简单、直接、易于理解和使用.理论分析以及在人工合成和真实数据集的实验结果表明MMC算法能够有效地检测出低维和高维空间的聚类边界.Abstract: Manifold learning aims to find a reasonable embed mode to map a high-dimensional space to a low dimensional space. However, the dimension of the latter may still be so high that any data mining task cannot be effectively finished. This paper proposes a simple matrix model to judge the symmetry of data object and its k nearest neighbors space, and use the symmetry rate to extract the clustering boundary. Finally, the MMC algorithm is developed. Theoretical analysis and experimental results show that the MMC can effectively detect the clustering boundary of low and high dimensional spaces.
Key words:
- High dimensional space /
- clustering boundary /
- martin model /
- k nearest neighbors /
- symmetry rate
表 1 预处理方式
Table 1 Pretreatment methods
数据集 样本总数 维数 预处理方式 Mnist 10 000 28 3) Colon 62 2 000 1) Prostate 102 10 509 2) Pointing data 2 790 384 3) 表 2 不同算法在不同数据集上聚类边界检测结果
Table 2 The clustering boundary detection results of different algorithms on different data sets
数据集 维度 算法 真实边界数 检测边界数 检测正确边界数 准确率 召回率 F-measure DS1 2 BAND 640 823 556 0.6756 0.8688 0.7601 BORDER 723 540 0.7469 0.8438 0.7924 BRINK 667 520 0.7795 0.8125 0.7957 BRIM 680 536 0.7882 0.8375 0.8121 BERGE 662 532 0.8036 0.8313 0.8172 MMC 630 576 0.9143 0.9000 0.9071 DS2 2 BAND 538 749 454 0.6061 0.8439 0.7055 BORDER 669 445 0.6366 0.8271 0.7195 BRINK 499 438 0.8778 0.8141 0.8447 BRIM 562 466 0.8292 0.8661 0.8472 BERGE 553 472 0.8535 0.8773 0.8652 MMC 549 503 0.9162 0.9349 0.9255 DS3 2 BAND 1 077 1 629 961 0.5899 0.8923 0.7103 BORDER 1 252 831 0.6637 0.7716 0.7136 BRINK 1 540 914 0.5935 0.8478 0.6985 BRIM 1 188 935 0.7870 0.8682 0.8256 BERGE 1 138 942 0.8278 0.8747 0.8506 MMC 1 016 968 0.9528 0.8988 0.9250 DS4 2 BAND 1 204 1 944 1 056 0.5432 0.8771 0.6709 BORDER 1 802 1 089 0.6043 0.9045 0.7246 BRINK 1 817 1 003 0.5520 0.8331 0.6640 BRIM 1 355 1 062 0.7838 0.8821 0.8300 BERGE 1 246 1 123 0.9013 0.9327 0.9167 MMC 1 228 1 138 0.9267 0.9452 0.9359 Biomed 4 BAND 30 26 22 0.8462 0.7333 0.7857 BORDER 26 23 0.8846 0.7667 0.8214 BRINK 36 30 0.8333 1.0000 0.9089 BERGE 26 24 0.9231 0.8000 0.8572 MMC 30 28 0.9333 0.9333 0.9333 Cancer 10 BAND 37 37 25 0.6757 0.6757 0.6757 BORDER 37 28 0.7568 0.7568 0.7568 BRINK 37 29 0.7837 0.7837 0.7837 BERGE 37 28 0.7568 0.7568 0.7568 MMC 38 34 0.8947 0.9189 0.9067 Colon 2 000 BAND 7 6 5 0.8333 0.7143 0.7692 BORDER 7 7 1.0000 1.0000 1.0000 BRINK 6 5 0.8333 0.7143 0.7692 BERGE 6 5 0.8333 0.7143 0.7692 MMC 7 7 1.0000 1.0000 1.0000 Prostate 10 509 BAND 18 17 16 0.9412 0.8889 0.9143 BORDER 19 18 0.9474 1.0000 0.9730 BRINK 17 16 0.9412 0.8889 0.9143 BERGE 17 16 0.9412 0.8889 0.9143 MMC 18 18 1.0000 1.0000 1.0000 -
