摘要:
基于图的算法已经成为半监督学习中的一种流行方法, 该方法把数据定义为图的节点, 用图的边表示数据之间的关系, 在各种数据分布情况下都具有很高的分类准确度. 然而图方法的计算复杂度比较高, 当图的规模比较大时, 计算所需要的时间和存储都非常大, 这在一定程度上限制了图方法的使用. 因此, 如何控制图的大小是基于图的半监督学习算法中的一个重要问题. 本文提出了一种基于密度估计的快速聚类方法, 可以在局部范围对数据点进行聚类, 以聚类形成的子集作为构图的节点, 从而大大降低了图的复杂度. 新的聚类方法计算量较小, 通过推导得到的距离函数能较好地保持原有数据分布. 实验结果表明, 通过局部聚类后构建的小图在分类效果上与在原图上的结果相当, 同时在计算速度上有极大的提高.
Abstract:
Graph-based semi-supervised methods define a graph where the nodes are labeled and unlabeled examples in the dataset, and edges reflect the similarity of examples. These methods usually assume label smoothness over the graph. Graph methods are nonparametric, discriminative, and transductive in nature. These methods take high classification accuracy on variant data distributions. But the computation complexity is very high. As the size of dataset grows, the graph will be too large to compute and this limits the extension of its usage. In this paper, we propose a novel method for fast computation based on local clustering, which is very efficient for reduction of graph size and can maintain the accuracy at the same time. The local clustering method is of low computation complexity and the data structure can be preserved by a newly designed distance function. Experimental results show that this approach can preserve the accuracy of purely graph-based methods and significantly reduce computational cost.