Data Stream Clustering Algorithm Based on Density and Affinity Propagation Techniques
-
摘要: 针对现有算法聚类精度不高、处理离群点能力较差以及不能实时检测数据流变化的缺陷,提出一种基于密度与近邻传播融合的数据流聚类算法.该算法采用在线/离线两阶段处理框架,通过引 入微簇衰减密度来精确反映数据流的演化信息,并采用在线动态维护和删减微簇机制,使算法模型更 符合原始数据流的内在特性.同时,当模型中检测到新的类模式出现时,采用一种改进的加权近邻传播聚类(Weighted and hierarchical affinity propagation,WAP)算法对模 型进行重建,因而能够实时检测到数据流的变化,并能给出任意时间的聚类结果.在真实数据集和人工 数据集上的实验表明,该算法具有良好的适用性、有效性和可扩展性,能够取得较好的聚类效果.Abstract: For the accuracy of the existing clustering algorithm is not high, and the ability of dealing with outliers is poor and unable to detect the real-time changes of data stream, a data stream clustering algorithm based on density and affinity propagation is proposed. The algorithm adopts an online/offline two-stage processing framework and it introduces the micro-cluster decay density to reflect the evolution of the data stream accurately. In the meantime, it uses the mechanism of online dynamic maintenance and deletion of the micro-cluster, which makes the algorithm's model more consistent with the intrinsic characteristics of the original data streams. Simultaneously, it also takes an improved WAP (weighted and hierarchical affinity propagation) algorithm to reconstruct the models when detecting a new emerging class model. Thus it can detect the changes of the data stream in real time, and give the clustering results at any time. Experiments on real data sets and artificial data sets show that the algorithm has good applicability, efficiency, and scalability, thus it can achieve better clustering results.
-
[1] Hassani M, Spaus P, Gaber M M, Seidl T. Density-based projected clustering of data streams. In: Proceedings of the 2012 Scalable Uncertainty Management, Berlin Heidelberg, Springer, 2012. 311-324 [2] Bifet A, Holmes G, Pfahringer B, Kranen P, Kremer H, Jansen T, Seidl T. MOA: massive online analysis, a framework for stream classification and clustering. The Journal of Machine Learning Research, 2010, 99: 1601-1604 [3] Aggarwal C C, Han J W, Wang J Y, Yu P S. A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases (Vol. 29), VLDB Endowment, 2003. 81-92 [4] Aggarwal C C, Han J W, Wang J Y, Yu P S. A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on Very Large Data Bases (Vol. 30), VLDB Endowment, 2004. 852-863 [5] Cao F, Ester M, Qian W, Zhou A Y. Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, Bethesda, USA, 2006. 328-339 [6] Chen Y, Tu L. Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, San Jose, California, 2007. 133-142 [7] Tu L, Chen Y. Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery from Data (TKDD), 2009, 3(3): 12-20 [8] Yang Ning, Tang Chang-Jie, Wang Yue, Chen Yu, Zheng J L. Clustering algorithm on data stream with skew distribution based on temporal density. Journal of Software, 2010, 21(5): 1031-1041 (杨宁,唐常杰,王悦,陈瑜,郑皎凌.一种基于时态密度的倾斜分布数据流聚类算法. 软件学报, 2010 21(5): 1031-1041) [9] Ntoutsi I, Zimek A, Palpanas T, Kröger P, Kriegel H P. Density-based projected clustering over high dimensional data streams. SIAM SDM, 2012, 12: 987-998 [10] Yu Yan-Wei, Wang Qin, Kuang Jun, He Jie. An on-line density-based clustering algorithm for spatial data stream. Acta Automatica Sinica, 2012, 38(6): 1051-1058 (于彦伟, 王沁, 邝俊, 何杰. 一种基于密度的空间数据流在线聚类算法. 自动化学报, 2012, 38(6): 1051-1058) [11] Zhu Qun, Zhang Yu-Hong, Hu Xue-Gang, Li Pei-Pei. A double-window-based classification algorithm for concept drifting data streams. Acta Automatica Sinica, 2011, 37(9): 1077-1084 (朱群, 张玉红, 胡学钢, 李培培.一种基于双层窗口的概念漂移数据流分类算法. 自动化学报, 2011, 37(9): 1077-1084) [12] Tang J. An Algorithm for Streaming Clustering [Ph.,D. dissertation], Uppsala University, Sweden, 2011 [13] Zhang X, Furtlehner C, Sebag M. Data streaming with affinity propagation. In: Proceedings of the 2008 Machine Learning and Knowledge Discovery in Databases, Berlin Heidelberg, Springer, 2008. 628-643 [14] Wang Kai-Jun, Zhang Jun-Ying, Li Dan, Zhang Xin-Na, Guo Tao. Adaptive affinity propagation clustering. Acta Automatica Sinica, 2007, 33(12): 1242-1246 (王开军, 张军英, 李丹,张新娜,郭涛. 自适应仿射传播聚类. 自动化学报, 2007, 33(12): 1242-1246) [15] Huang De-Cai, Wu Tian-Hong. Density-based clustering algorithm for mixture data sets. Control and Decision, 2010, 25(3): 416-421 (黄德才, 吴天虹. 基于密度的混合属性数据流聚类算法. 控制与决策, 2010 25(3): 416-421) [16] Zhao L, Kang H S, Kim S R. Improved clustering for intrusion detection by principal component analysis with effective noise reduction. In: Proceedings of the 2013 Information and Communicatiaon Technology, Berlin Heidelberg, Springer, 2013. 490-495
点击查看大图
计量
- 文章访问数: 2345
- HTML全文浏览量: 119
- PDF下载量: 1081
- 被引次数: 0