2.845

2023影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于概率语义分布的短文本分类

马成龙 颜永红

马成龙, 颜永红. 基于概率语义分布的短文本分类. 自动化学报, 2016, 42(11): 1711-1717. doi: 10.16383/j.aas.2016.c150268
引用本文: 马成龙, 颜永红. 基于概率语义分布的短文本分类. 自动化学报, 2016, 42(11): 1711-1717. doi: 10.16383/j.aas.2016.c150268
MA Cheng-Long, YAN Yong-Hong. Short Text Classification Based on Probabilistic Semantic Distribution. ACTA AUTOMATICA SINICA, 2016, 42(11): 1711-1717. doi: 10.16383/j.aas.2016.c150268
Citation: MA Cheng-Long, YAN Yong-Hong. Short Text Classification Based on Probabilistic Semantic Distribution. ACTA AUTOMATICA SINICA, 2016, 42(11): 1711-1717. doi: 10.16383/j.aas.2016.c150268

基于概率语义分布的短文本分类

doi: 10.16383/j.aas.2016.c150268
基金项目: 

国家重点基础研究发展计划(973计划) 2013CB329302

国家高技术研究发展计划(863计划) 2015AA016306

新疆维吾尔自治区科技重大专项 201230118-3

中国科学院战略性先导科技专项 XDA06030100, XDA06030500, XDA06040603

国家自然科学基金 11461141004, 61271426, 11504406, 11590770, 11590771, 11590772, 11590773, 11590774

详细信息
    作者简介:

    颜永红 中国科学院声学研究所语言声学与内容理解重点实验室教授.1990年在清华大学获得学士学位, 1995年8月于美国俄勒冈研究院(Oregon GraduateInstitute, OGI) 获得计算机科学和工程博士学位.他曾在OGI 担任助理教授(1995年), 副教授(1998年) 和副主任(1997年).主要研究方向为语音处理和识别, 语言/说话人识别和人机界面.E-mail:yanyonghong@hccl.ioa.ac.cn

    通讯作者:

    马成龙 中国科学院声学研究所博士研究生.2011年获得山东大学(威海) 通信工程学士学位.主要研究方向为自然语言处理, 口语理解, 情感分析, 深度学习.E-mail:machenglong@hccl.ioa.ac.cn

Short Text Classification Based on Probabilistic Semantic Distribution

Funds: 

National Basic Research Program of China (973 Program) 2013CB329302

National High Technology Research Program of China (863 Program) 2015AA016306

the Key Science and Technology Project of the Xinjiang Uygur Autonomous Region 201230118-3

the Strategic Priority Research Program of the Chinese Academy of Sciences XDA06030100, XDA06030500, XDA06040603

National Natural Science Foundation of China 11461141004, 61271426, 11504406, 11590770, 11590771, 11590772, 11590773, 11590774

More Information
    Author Bio:

    Professor at The Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences. He received his bachelor degree from Tsinghua University in 1990, and Ph.,D. degree from Oregon Graduate Institute (OGI), USA. He worked in OGI as assistant professor (1995), associate professor (1998) and associate director (1997) of Center for Spoken Language Understanding. His research interest covers speech processing and recognition, language/speaker recognition, and human computer interface.

    Corresponding author: MA Cheng-Long Ph.,D. candidate at the Institute of Acoustics, Chinese Academy of Sciences. He received his bachelor degree from Shandong University, Weihai in 2011. His research interest covers natural language processing, spoken language understanding, sentiment analysis and deep learning. Corresponding author of this paper.
  • 摘要: 在短文本分类中,面对特征稀疏的短文本,如何充分利用文本中的每一个词语成为关键.本文提出概率语义分布模型的思想,首先通过查询词矢量词典,将文本转换为词矢量数据;其次,在概率语义分布模型的假设下利用混合高斯模型对无标注的文本数据进行通用背景语义模型训练;利用训练数据对通用模型进行自适应得到各个领域的目标领域语义分布模型;最后,在测试过程中,计算短文本属于领域模型的概率,得到最终的分类结果.实验结果表明,本文提出的方法能够从一定程度上利用短文本所提供的信息,有效降低了对训练数据的依赖性,相比于支持向量机(Support vector machine,SVM)和最大熵分类方法性能相对提高了17.7%.
  • 图  1  基于通用语义背景模型的短文本分类

    Fig.  1  Short text classification based on universal semantic background model

    图  2  不同的背景数据和高斯数对分类结果的影响

    Fig.  2  Influence of background data and \\the number of GMM

    图  3  训练数据大小对分类效果的影响(1)

    Fig.  3  Influence of training set size (1)

    图  4  训练数据大小对分类效果的影响(2)

    Fig.  4  Influence of training set size (2)

    表  1  网页搜索片段数据分布

    Table  1  Statistics of web snippets data

    编号领域训练数据测试数据
    1商业1200300
    2计算机1200300
    3文化与艺术1880330
    4教育与科技2360300
    5技术220150
    6健康880300
    7社会政策1200300
    8体育1120300
    共计100602280
    下载: 导出CSV

    表  2  未登录词分布

    Table  2  Statistics of unseen words

    原始单词词干
    训练数据26 265 21 596
    测试数据10 037 8 200
    未登录词4 378 3 677
    未登录词的比例43.62% 44.84%
    下载: 导出CSV

    表  3  与基线系统对比实验结果(%)

    Table  3  Experimental results of the proposed method against other methods (%)

    方法Accuracy
    TF*IDF+SVM 66.14
    TF*IDF+MaxEnt 66.80
    LDA+MaxEnt 82.18
    Wiki feature+SVM 76.89
    Paragraph vector+SVM 61.90
    LSTM 63.00
    本文的方法80.00
    下载: 导出CSV

    表  4  SVM、MaxEnt和本文方法的实验结果

    Table  4  Evaluations of SVM,MaxEnt and the proposed method

    SVMMaxEnt本文的方法
    领域P (%)R (%)F1P (%)R (%)F1P (%)R (%)F1
    社会政策77.6152.000.622870.7550.000.585986.3670.370.7755
    计算机73.7563.670.683472.2666.000.689980.3187.290.8365
    教育与科技41.9882.000.555345.9382.670.590581.6068.230.7432
    体育85.1976.670.807086.0878.330.820284.5489.930.8715
    健康89.0156.670.692586.9464.330.739576.3585.570.8070
    技术76.5350.000.604872.8439.330.510858.8293.330.7216
    商业70.3757.000.629868.0560.330.639673.9967.330.7051
    文化与艺术62.2781.520.706062.8678.480.698188.1577.850.8268
    下载: 导出CSV
  • [1] Wang B K, Huang Y F, Yang W X, Li X. Short text classification based on strong feature thesaurus. Journal of Zhejiang University Science C, 2012, 13(9): 649-659 doi: 10.1631/jzus.C1100373
    [2] Zelikovitz S, Hirsh H. Improving short text classification using unlabeled background knowledge to assess document similarity. In: Proceedings of the 17th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 2000. 1183-1190
    [3] Bollegala D, Matsuo Y, Ishizuka M. Measuring semantic similarity between words using web search engines. In: Proceedings of the 16th International Conference on World Wide Web. New York, USA: ACM, 2007. 757-766
    [4] Gabrilovich E, Markovitch S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. San Francisco, USA: Morgan Kaufmann, 2007. 1606-1611
    [5] Banerjee S, Ramanathan K, Gupta A. Clustering short texts using Wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2007. 787-788
    [6] Lucene [Online], available: https://lucene.apache.org/, May 3, 2016.
    [7] Phan X H, Nguyen L M, Horiguchi S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web. New York, USA: ACM, 2008. 91-100
    [8] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022 http://cn.bing.com/academic/profile?id=1880262756&encoded=0&v=paper_preview&mkt=zh-cn
    [9] Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics, 2014. 1746-1751
    [10] Le Q, Mikolov T. Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning. Beijing, China: JMLR, 2014. 1188-1196
    [11] Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, USA: Association for Computational Linguistics, 2014. 655-665
    [12] Landauer T K, Foltz P W, Laham D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2-3): 259-284 doi: 10.1080/01638539809545028
    [13] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv: 1301.3781, 2013.
    [14] Turian J, Ratinov L, Bengio Y. Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden: Association for Computational Linguistics, 2010. 384-394
    [15] Mikolov T, Yih W T, Zweig G. Linguistic Regularities in Continuous Space Word Representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia: Association for Computational Linguistics, 2013. 746-751
    [16] Reynolds D A. Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 1995, 17(1-2): 91-108 doi: 10.1016/0167-6393(95)00009-D
    [17] Reynolds D A, Quatieri T F, Dunn R B. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 2000, 10(1-3): 19-41 doi: 10.1006/dspr.1999.0361
    [18] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011, 12: 2493-2537
    [19] Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 2013 Advances in Neural Information Processing Systems. Lake Tahoe, Nevada, USA: Curran Associates, Inc., 2013. 3111-3119
    [20] Porter M F. An algorithm for suffix stripping. Readings in Information Retrieval. San Francisco: Morgan Kaufmann, 1997. 313-316
    [21] Ling G C, Asahara M, Matsumoto Y. Chinese unknown word identification using character-based tagging and chunking. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics, 2003. 197-200
    [22] Parker R, Graff D, Kong J B, Chen K, Maeda K. English Gigaword Fifth Edition [Online], available: https://catalog.ldc.upenn.edu/LDC2011T07, May 3, 2016.
    [23] Wang P, Xu B, Xu J M, Tian G H, Liu C L, Hao H W. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing, 2016, 174: 806-814 doi: 10.1016/j.neucom.2015.09.096
    [24] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780 doi: 10.1162/neco.1997.9.8.1735
  • 加载中
图(4) / 表(4)
计量
  • 文章访问数:  2286
  • HTML全文浏览量:  199
  • PDF下载量:  903
  • 被引次数: 0
出版历程
  • 收稿日期:  2015-05-19
  • 录用日期:  2016-05-03
  • 刊出日期:  2016-11-01

目录

    /

    返回文章
    返回