基于多源的跨领域数据分类快速新算法

顾鑫; 王士同; 许敏

doi:10.3724/SP.J.1004.2014.00531

基于多源的跨领域数据分类快速新算法

doi: 10.3724/SP.J.1004.2014.00531 cstr: 32138.14.SP.J.1004.2014.00531

顾鑫^1,2, ,,
王士同¹,
许敏^1,3

1.
江南大学数字媒体学院无锡 214122;
2.
江苏北方湖光光电有限责任公司无锡 214035;
3.
无锡职业技术学院无锡 214000

基金项目:

国家自然科学基金（60903100，60975027）资助

详细信息

作者简介:
王士同教授, 中国计算机学会高级会员. 主要研究方向为人工智能, 模式识别, 数据挖掘, 神经网络, 模糊系统, 医学图像处理和生物信息学.E-mail：wxwangst@yahoo.com.cn

通讯作者:
顾鑫

计量
- 文章访问数: 2132
- HTML全文浏览量: 99
- PDF下载量: 1442
- 被引次数: 0
出版历程
- 收稿日期: 2012-06-25
- 修回日期: 2013-02-04
- 刊出日期: 2014-03-20

A New Cross-multidomain Classification Algorithm and Its Fast Version for Large Datasets

GU Xin^{1,2
, ,},
WANG Shi-Tong¹,
XU Min^1,3

1.
School of Digital Media, Jiangnan University, Wuxi 214122;
2.
Jiangsu North Huguang Opto-Electronics Co.Ltd., Wuxi 214035;
3.
Wuxi Institute of Technology, Wuxi 214000

Funds:

Supported by National Natural Science Foundation of China (60903100, 60975027)

摘要

摘要: 研究跨领域学习与分类是为了将对多源域的有监督学习结果有效地迁移至目标域，实现对目标域的无标记分类. 当前的跨领域学习一般侧重于对单一源域到目标域的学习，且样本规模普遍较小，此类方法领域自适应性较差，面对大样本数据更显得无能为力，从而直接影响跨域学习的分类精度与效率. 为了尽可能多地利用相关领域的有用数据，本文提出了一种多源跨领域分类算法（Multiple sources cross-domain classification，MSCC），该算法依据被众多实验证明有效的罗杰斯特回归模型与一致性方法构建多个源域分类器并综合指导目标域的数据分类. 为了充分高效利用大样本的源域数据，满足大样本的快速运算，在MSCC的基础上，本文结合最新的CDdual （Dual coordinate descent method）算法，提出了算法MSCC的快速算法MSCC-CDdual，并进行了相关的理论分析. 人工数据集、文本数据集与图像数据集的实验运行结果表明，该算法对于大样本数据集有着较高的分类精度、快速的运行速度和较高的领域自适应性. 本文的主要贡献体现在三个方面：1）针对多源跨领域分类提出了一种新的一致性方法，该方法有利于将MSCC算法发展为MSCC-CDdual快速算法；2）提出了MSCC-CDdual快速算法，该算法既适用于样本较少的数据集又适用于大样本数据集；3） MSCC-CDdual 算法在高维数据集上相比其他算法展现了其独特的优势.
- 跨领域 /
- 多源 /
- 罗杰斯特回归 /
- 后验概率 /
- 分类
Abstract: Cross-domain learning and classification involved in this paper attempts to effectively transfer the classification results obtained from supervised multisource domains to an unsupervised target domain. Generally speaking, although current cross-domain learning methods have obtained great successes for cross-single-domain learning problems, they will encounter overwhelming troubles in the sense of classification accuracy and running speed when carrying out them on large cross-multisource datasets. In this paper, based on the logistic regression model and the proposed consensus measure, a multi-source cross-domain classification (MSCC) algorithm is proposed to realize effective cross-domain classification for the target domain. In order to enable the MSCC to work well for large datasets, based on the algorithm CDdual (Dual coordinate descent method) as the recent advance about large-scale logistic regression, an MSCC's fast version MSCC-CDdual for large datasets is derived and theoretically analysed. The experimental results on artificial data, text data and image data indicate that the proposed algorithm MSCC-CDdual has a fast speed, high classification accuracy and good domain adaption for large cross-multisource datasets. The contributions of the work here contain three aspects: 1) A novel consensus measure is proposed, which is suitable for boosting multi-classifiers and convenient for us to develop MSCC's fast version for large datasets; 2) The proposed algorithm MSCC-CDdual is demonstrated to be suitable for cross-multisource learning for both small and large datasets; 3) MSCC-CDdual exhibits its additional advantage, i.e., the applicability for high dimensional datasets from another large perspective.
- Cross-domain /
- multi-source /
- logistic regression /
- posterior probability /
- classification

HTML全文

参考文献(26)

[1]	Yang J, Yan R, Hauptmann A G. Cross-domain video concept detection using adaptive SVMs. In: Proceedings of the 15th International Conference on Multimedia. New York, USA: ACM, 2007. 188-197
[2]	[2] Blitzer J, McDonald R, Pereira F. Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2006. 120-128
[3]	[3] Pan S J, Tsang I W H, Kwok J T Y, Yang Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011, 22(2): 199-210
[4]	[4] Dai W Y, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. New York, USA: ACM, 2007. 193-200
[5]	[5] Dai W Y, Xue G R, Yang Q, Yu Y. Co-clustering based classification for out-of-domain documents. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, USA: ACM, 2007. 210-219
[6]	[6] Xing D K, Dai W Y, Xue G R, Yu Y. Bridged refinement for transfer learning. In: Proceedings of the 11th European Conference Practice of Knowledge Discovery in Databases. Berlin: Springer, 2007. 324-335
[7]	[7] Suzuki T, Sugiyama M, Tanaka T. Mutual information approximation via maximum likelihood estimation of density ratio. In: Proceedings of the 2009 IEEE international conference on Symposium on Information Theory. NJ, USA: IEEE, 2009. 463-467
[8]	[8] Suzuki T, Sugiyama M, Sese J, Kanamori T. Approximating mutual information by maximum likelihood density ratio estimation. In: Proceedings of the JMLR: Workshop and Conference Proceedings. NJ, USA: IEEE, 2008. 4: 5-20
[9]	[9] Zhuang F Z, Luo P, Xiong H, Xiong Y H, He Q, Shi Z Z. Cross-domain learning from multiple sources: a consensus regularization perspective. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(12): 1664-1678
[10]	Bollegala D, Weir D, Carroll J. Using multiple sources to construct a sentiment sensitive thesaurus for cross-domain sentiment classification. In: HLT'11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2011. 132-141
[11]	Hosmer D W, Lemeshow S. Applied Logistic Regression. Hoboken, NJ: John Wiley Sons Press, 2001
[12]	Cal D, Condorelli A, Papa S, Rata M, Zagarella L. Improving intelligence through use of natural language processing. A comparison between NLP interfaces and traditional visual GIS interfaces. Procedia Computer Science, 2011, 21(5): 920-925
[13]	Yu H F, Huang F L, Lin C J. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 2011, 85(1-2): 41-75
[14]	Gauvain J L, Lee C H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 1994, 2(2): 291-298
[15]	Ruszczynski A. Nonlinear Optimization. Princeton, NJ: Princeton University Press, 2006
[16]	Keerthi S S, Duan K B, Shevade S K, Poo A N. A fast dual algorithm for kernel logistic regression. Machine Learning, 2005, 61(1-3): 151-165
[17]	Joachims T. Making large-scale support vector machine learning practical. Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press, 1999. 169-184
[18]	Collobert P, Sinz P, Weston P, Bottou L. Large scale transductive SVMs. The Journal of Machine Learning Research, 2006, 7: 1687-1712
[19]	Joachims T. Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1999. 200-209
[20]	Joachims T. Transductive learning via spectral graph partitioning. In: Proceedings of the 20th International Conference on Machine Learning. New York, USA: ACM, 2003. 290-297
[21]	Chapelle O, Zien A. Semi-supervised classification by low density separation. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics. San Francisco, CA: Morgan Kaufmann 2005. 57-64
[22]	Chapelle O, Chi M M, Zien A. A continuation method for semi-supervised SVMs. In: Proceedings of the 23rd International Conference on Machine Learning. New York, USA: ACM, 2006. 185-192
[23]	Lin C J, Weng R C, Keerthi S S. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 2008, 9(4): 627-650
[24]	Deng W B. A limited memory quasi-Newton method for large scale problem. Numerical Mathematics, 1996, 5(1): 71-79
[25]	Zhang Lei. The Research on Human-computer Cooperation in Content-based Image Retrieval [Ph.D. dissertation], Tsinghua University, China, 2001 (张磊. 基于人机交互的内容图像检索研究 [博士论文]. 清华大学, 中国, 2001)
[26]	Shi Z P, Ye F, He Q, Shi Z Z. Symmetrical invariant LBP texture descriptor and application for image retrieval. In: Proceedings of the 2008 Congress on Image and Signal Processing. Sanya, China: IEEE Computer Society, 2008. 825-829

施引文献

资源附件(0)

访问统计

计量

文章访问数: 2132
HTML全文浏览量: 99
PDF下载量: 1442
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于多源的跨领域数据分类快速新算法

doi: 10.3724/SP.J.1004.2014.00531 cstr: 32138.14.SP.J.1004.2014.00531

作者简介:
王士同教授, 中国计算机学会高级会员. 主要研究方向为人工智能, 模式识别, 数据挖掘, 神经网络, 模糊系统, 医学图像处理和生物信息学.E-mail：wxwangst@yahoo.com.cn

通讯作者:
顾鑫

计量

A New Cross-multidomain Classification Algorithm and Its Fast Version for Large Datasets

计量

目录

留言板

基于多源的跨领域数据分类快速新算法

doi: 10.3724/SP.J.1004.2014.00531 cstr: 32138.14.SP.J.1004.2014.00531

作者简介: 王士同 教授, 中国计算机学会高级会员. 主要研究方向为人工智能, 模式识别, 数据挖掘, 神经网络, 模糊系统, 医学图像处理和生物信息学.E-mail：wxwangst@yahoo.com.cn

通讯作者: 顾鑫

计量

出版历程

A New Cross-multidomain Classification Algorithm and Its Fast Version for Large Datasets

计量

出版历程

目录

作者简介:
王士同教授, 中国计算机学会高级会员. 主要研究方向为人工智能, 模式识别, 数据挖掘, 神经网络, 模糊系统, 医学图像处理和生物信息学.E-mail：wxwangst@yahoo.com.cn

通讯作者:
顾鑫