A New Cross-multidomain Classification Algorithm and Its Fast Version for Large Datasets
-
摘要: 研究跨领域学习与分类是为了将对多源域的有监督学习结果有效地迁移至目标域,实现对目标域的无标记分 类. 当前的跨领域学习一般侧重于对单一源域到目标域的学习,且样本规模普遍较小,此类方法领域自适应性较差,面对 大样本数据更显得无能为力,从而直接影响跨域学习的分类精度与效率. 为了尽可能多地利用相关领域的有用数据,本文 提出了一种多源跨领域分类算法(Multiple sources cross-domain classification,MSCC),该算法依据被众多实验证明有效的罗杰斯特回归模型与一致性方法构建多个源域分类器并综合指导目标域的数据分类. 为了充分高效利用大样本的 源域数据,满足大样本的快速运算,在MSCC的基础上,本文结合最新的CDdual (Dual coordinate descent method)算 法,提出了算法MSCC的快速算法MSCC-CDdual,并进行了相关的理论分析. 人工数据集、文本数据集与图像数据集的实 验运行结果表明,该算法对于大样本数据集有着较高的分类精度、快速的运行速度和较高的领域自适应性. 本文的主要贡 献体现在三个方面:1)针对多源跨领域分类提出了一种新的一致性方法,该方法有利于将MSCC算法发展为MSCC-CDdual快速算法;2)提出了MSCC-CDdual快速算法,该算法既适用于样本较少的数据集又适用于大样本数据集;3) MSCC-CDdual 算法在高维数据集上相比其他算法展现了其独特的优势.Abstract: Cross-domain learning and classification involved in this paper attempts to effectively transfer the classification results obtained from supervised multisource domains to an unsupervised target domain. Generally speaking, although current cross-domain learning methods have obtained great successes for cross-single-domain learning problems, they will encounter overwhelming troubles in the sense of classification accuracy and running speed when carrying out them on large cross-multisource datasets. In this paper, based on the logistic regression model and the proposed consensus measure, a multi-source cross-domain classification (MSCC) algorithm is proposed to realize effective cross-domain classification for the target domain. In order to enable the MSCC to work well for large datasets, based on the algorithm CDdual (Dual coordinate descent method) as the recent advance about large-scale logistic regression, an MSCC's fast version MSCC-CDdual for large datasets is derived and theoretically analysed. The experimental results on artificial data, text data and image data indicate that the proposed algorithm MSCC-CDdual has a fast speed, high classification accuracy and good domain adaption for large cross-multisource datasets. The contributions of the work here contain three aspects: 1) A novel consensus measure is proposed, which is suitable for boosting multi-classifiers and convenient for us to develop MSCC's fast version for large datasets; 2) The proposed algorithm MSCC-CDdual is demonstrated to be suitable for cross-multisource learning for both small and large datasets; 3) MSCC-CDdual exhibits its additional advantage, i.e., the applicability for high dimensional datasets from another large perspective.
-
Key words:
- Cross-domain /
- multi-source /
- logistic regression /
- posterior probability /
- classification
-
[1] Yang J, Yan R, Hauptmann A G. Cross-domain video concept detection using adaptive SVMs. In: Proceedings of the 15th International Conference on Multimedia. New York, USA: ACM, 2007. 188-197 [2] [2] Blitzer J, McDonald R, Pereira F. Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2006. 120-128 [3] [3] Pan S J, Tsang I W H, Kwok J T Y, Yang Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011, 22(2): 199-210 [4] [4] Dai W Y, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. New York, USA: ACM, 2007. 193-200 [5] [5] Dai W Y, Xue G R, Yang Q, Yu Y. Co-clustering based classification for out-of-domain documents. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, USA: ACM, 2007. 210-219 [6] [6] Xing D K, Dai W Y, Xue G R, Yu Y. Bridged refinement for transfer learning. In: Proceedings of the 11th European Conference Practice of Knowledge Discovery in Databases. Berlin: Springer, 2007. 324-335 [7] [7] Suzuki T, Sugiyama M, Tanaka T. Mutual information approximation via maximum likelihood estimation of density ratio. In: Proceedings of the 2009 IEEE international conference on Symposium on Information Theory. NJ, USA: IEEE, 2009. 463-467 [8] [8] Suzuki T, Sugiyama M, Sese J, Kanamori T. Approximating mutual information by maximum likelihood density ratio estimation. In: Proceedings of the JMLR: Workshop and Conference Proceedings. NJ, USA: IEEE, 2008. 4: 5-20 [9] [9] Zhuang F Z, Luo P, Xiong H, Xiong Y H, He Q, Shi Z Z. Cross-domain learning from multiple sources: a consensus regularization perspective. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(12): 1664-1678 [10] Bollegala D, Weir D, Carroll J. Using multiple sources to construct a sentiment sensitive thesaurus for cross-domain sentiment classification. In: HLT'11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2011. 132-141 [11] Hosmer D W, Lemeshow S. Applied Logistic Regression. Hoboken, NJ: John Wiley Sons Press, 2001 [12] Cal D, Condorelli A, Papa S, Rata M, Zagarella L. Improving intelligence through use of natural language processing. A comparison between NLP interfaces and traditional visual GIS interfaces. Procedia Computer Science, 2011, 21(5): 920-925 [13] Yu H F, Huang F L, Lin C J. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 2011, 85(1-2): 41-75 [14] Gauvain J L, Lee C H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 1994, 2(2): 291-298 [15] Ruszczynski A. Nonlinear Optimization. Princeton, NJ: Princeton University Press, 2006 [16] Keerthi S S, Duan K B, Shevade S K, Poo A N. A fast dual algorithm for kernel logistic regression. Machine Learning, 2005, 61(1-3): 151-165 [17] Joachims T. Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press, 1999. 169-184 [18] Collobert P, Sinz P, Weston P, Bottou L. Large scale transductive SVMs. The Journal of Machine Learning Research, 2006, 7: 1687-1712 [19] Joachims T. Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1999. 200-209 [20] Joachims T. Transductive learning via spectral graph partitioning. In: Proceedings of the 20th International Conference on Machine Learning. New York, USA: ACM, 2003. 290-297 [21] Chapelle O, Zien A. Semi-supervised classification by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. San Francisco, CA: Morgan Kaufmann 2005. 57-64 [22] Chapelle O, Chi M M, Zien A. A continuation method for semi-supervised SVMs. In: Proceedings of the 23rd International Conference on Machine Learning. New York, USA: ACM, 2006. 185-192 [23] Lin C J, Weng R C, Keerthi S S. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 2008, 9(4): 627-650 [24] Deng W B. A limited memory quasi-Newton method for large scale problem. Numerical Mathematics, 1996, 5(1): 71-79 [25] Zhang Lei. The Research on Human-computer Cooperation in Content-based Image Retrieval [Ph.D. dissertation], Tsinghua University, China, 2001 (张磊. 基于人机交互的内容图像检索研究 [博士论文]. 清华大学, 中国, 2001) [26] Shi Z P, Ye F, He Q, Shi Z Z. Symmetrical invariant LBP texture descriptor and application for image retrieval. In: Proceedings of the 2008 Congress on Image and Signal Processing. Sanya, China: IEEE Computer Society, 2008. 825-829
点击查看大图
计量
- 文章访问数: 1859
- HTML全文浏览量: 87
- PDF下载量: 1423
- 被引次数: 0