-
摘要: 情感主题联合生成模型已经成功应用于网络评论分析.然而,随着智能终端设备的广泛应用,由于屏幕及输入限制,用户书写的评论越来越短,我们不得不面对短评论中的文本稀疏问题.本文提出了一个针对短文本的联合情感--主题模型SSTM(Short-text sentiment-topic model)来解决稀疏性问题.不同于一般主题模型中通常采用的基于文档产生过程的建模方法,我们直接对整个语料集合的产生过程建模.在产生文档集的过程中,我们每次采样一个词对,同一个词对中的词有相同的情感极性和主题.我们将SSTM模型应用于两个真实网络评论数据集.在三个实验任务中,通过定性分析验证了主题发现的有效性,并与经典方法进行定量对比,SSTM模型的文档级情感分类性能也有较大提升.Abstract: Topic and sentiment joint modelling has been successfully used in sentiment analysis for opinion text. However, we have to face the text sparse problem in opinion text when the length of text becomes shorter and shorter with popularity of smart devices. In this paper, we propose a joint sentiment-topic model SSTM (short-text sentiment-topic model) for short text. Unlike the topic model which models the generative process of each document, we directly model the generation of the whole review set. In the generation process of corpus, we sample a word-pair each time, in which the two words have the same sentiment label and topic. We apply SSTM to two real life social media datasets with three tasks. In the experiment, we demonstrate the effectiveness of the model on topic discovery by qualitative analysis. On the quantitative analysis of document level sentiment classification, SSTM model achieves better performance compared with the existing approaches.
-
Key words:
- Sentiment classification /
- sentiment topic model /
- topic model /
- short text topic mode /
- text sparse
-
表 1 论文中符号的含义
Table 1 Meanings of the notations
符号 描述 符号 描述 D 文档数量 β φ的非对称Dirichlet先验参数, M 词对数量 β = {{{βz, l, i}k=1T}l=1S}i=1V T 主题数目 α θ的Dirichlet先验参数 S 情感极性数 γ π的Dirichlet先验参数 V 词汇表大小 Θ 主题的多项式分布 b 词对, b = (wi, wj) zt 第t个词的主题 w 词 lt 第t个词的情感极性标签 z 主题 B 词对集合 l 情感极性标签 {z-t} 除第t个词以外的其他所有词的主题分布 πk, l 主题k和情感极性l上的分布 {l-t} 除第t个词以外的其他所有词的情感极性 Π 情感极性标签的多项式分布 Nk, l, i 词wi指派为主题k和情感极性l的次数 φk, l, w 词w基于主题k和情感极性l的分布 Nk, l 指派为主题k和情感极性l的词的数量 Φ 词的多项式分布 N'(·) 句子计数 θk 主题k的分布 Nk 主题k中的词的数量 表 2 语料统计信息
Table 2 Statistics of the text corpus
笔记本 手机 文档平均词数 20 32 评论数 3 988 2 289 词汇表大小 7 964 8 787 正面评论数 1 993 1 146 负面评论数 1 995 1 943 表 3 笔记本数据集中发现的部分主题词列表
Table 3 Example topics discovered from LAPTOP dataset
SSTM BTM LDA 外观 电池 散热性 外观 电池 散热性 外观 电池 散热性 指纹 电池 散热 太 电池 散热 容易 电池 好 钢琴 小时 热 容易 时间 好 指纹 小时 散热 漂亮 长 温度 指纹 小时 不错 外壳 时间 声音 烤漆 比较 好 键盘 键盘 电池 钢琴 长 风扇 好 时间 烫 烤漆 比较 度 烤漆 续航 小 模具 续航 CPU 比较 长 热 表面 比较 温度 屏幕 使用 硬盘 不错 好 温度 亮点 使用 热 外壳 上网 风扇 外壳 不错 声音 感觉 键盘 运行 文字 小 机器 钢琴 使用 使用 说 小巧 轻 呵呵 芯 比较 屏幕 续航 CPU 屏幕 芯 时 表 4 手机数据集中发现的部分主题词列表
Table 4 Example topics discovered from MOBILE dataset
SSTM BTM LDA 拍照 媒体播放 屏幕 拍照 媒体播放 屏幕 拍照 媒体播放 屏幕 拍摄 播放 屏幕 像素 MP3 屏幕 效果 支持 屏幕 功能 速度 好 摄像头 播放 色 摄像头 MP3 显示 支持 不错 显示 拍摄 耳机 显示 像素 播放 比较 屏幕 影音 色 数码 效果 TFT 拍照 内存 色彩 像素 手机 效果 手机 好 效果 照片 蓝牙 色 材质 处理器 彩色 支持 音乐 色彩 拍摄 卡 清晰 照片 格式 设计 倍 听 手机 拍 格式 高 摄像头 MP3 TFT 效果 功能 好 数码 扩展 铃声 拍照 流畅 机子 相机 不错 26万 相机 文件 方便 数码 文件 人 拍照 比较 像素 倍 视频 TFT 表 5 笔记本数据集上的CM值(%)
Table 5 CM(%) on laptop dataset
方法 标注员1 标注员2 标注员3 标注员4 平均值 LDA 58 50 60 56 56 BTM 70 66 75 72 70.75 SSTM 69 64 72 67 68 表 6 手机数据集上的CM值(%)
Table 6 CM(%) on mobile dataset
方法 标注员1 标注员2 标注员3 标注员4 平均值 LDA 69 65 71 74 69.75 BTM 76 74 81 81 78 SSTM 75 72 79 78 76 表 7 SSTM 发现的部分情感相关的主题词列表
Table 7 Example sentiment-specific topics discovered by SSTM
笔记本 手机 正面 负面 正面 负面 快递 性价比 外观 做工 售后 铃声 外观 按键 输入法 信号 速度 不错 小 有点 电话 铃声 设计 按键 短信 信号 东西 价格 漂亮 禁用 服务 不错 外观 手感 输入法 网络 京东 机器 喜欢 触摸板 差 耳机 不错 感觉 键 差 质量 便宜 买 需要 客服 听 好 好 切换 无 好 款 外观 外壳 送货 声音 感觉 操作 拼音 检测 发货 性能 本本 盖子 快递 放 喜欢 不错 数字 移动 问题 好 不错 小 货 音乐 漂亮 容易 麻烦 关机 比较 电脑 好 老版 无 好 时尚 使用 选 质量 很快 超值 键盘 掉 态度 耳朵 手感 摇杆 手 故障 送货 降价 适合 瑕疵 前台 效果 机身 舒服 标点符号 通话 表 8 情感极性识别结果(主题数目设置为25)
Table 8 Sentiment identification results (The number of topics is 25.)
基线 JST ASUM SSTM SVM (Uni) SVM (Bi) 笔记本 0.637645 0.50677 0.57754 0.65503 0.66047 0.70021 手机 0.602188 0.53698 0.43694 0.64201 0.64476 0.68953 -
[1] Fang L, Huang M L, Zhu X Y. Exploring weakly supervised latent sentiment explanations for aspect-level review analysis. In:Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management. New York, NY, USA:ACM, 2013.1057-1066 http://cn.bing.com/academic/profile?id=2061812507&encoded=0&v=paper_preview&mkt=zh-cn [2] 徐冰, 赵铁军, 王山雨, 郑德权. 基于浅层句法特征的评价对象抽取研究. 自动化学报, 2011, 37(10):1241-1247 http://www.aas.net.cn/CN/abstract/abstract17613.shtmlXu Bing, Zhao Tie-Jun, Wang Shan-Yu, Zheng De-Quan. Extraction of opinion targets based on shallow parsing features. Acta Automatica Sinica, 2011, 37(10):1241-1247 http://www.aas.net.cn/CN/abstract/abstract17613.shtml [3] 赵妍妍, 秦兵, 刘挺. 基于图的篇章内外特征相融合的评价句极性识别. 自动化学报, 2010, 36(10):1417-1425 http://www.aas.net.cn/CN/abstract/abstract17356.shtmlZhao Yan-Yan, Qin Bing, Liu Ting. Integrating intra-and inter-document evidences for improving sentence sentiment classification. Acta Automatica Sinica, 2010, 36(10):1417-1425 http://www.aas.net.cn/CN/abstract/abstract17356.shtml [4] Liu B. Sentiment Analysis and Opinion Mining. San Rafael, CA:Morgan Claypool Publishers, 2012. [5] Pang B, Lee L. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2008, 2(1-2):1-135 http://cn.bing.com/academic/profile?id=2097726431&encoded=0&v=paper_preview&mkt=zh-cn [6] Jo Y, Oh A H. Aspect and sentiment unification model for online review analysis. In:Proceedings of the 4th ACM International Conference on Web Search and Data Mining. New York, NY, USA:ACM, 2011.815-824 [7] He Y L, Lin C H, Alani H. Automatically extracting polarity-bearing topics for cross-domain sentiment classification. In:Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies——Volume 1. Stroudsburg, PA, USA:Association for Computational Linguistics, 2011.123-131 [8] Lin C H, He Y L. Joint sentiment/topic model for sentiment analysis. In:Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York, NY, USA:ACM, 2009.375-384 [9] 张林, 钱冠群, 樊卫国, 华琨, 张莉. 轻型评论的情感分析研究. 软件学报, 2014, 25(12):2790-2807 http://www.cnki.com.cn/Article/CJFDTOTAL-RJXB201412006.htmZhang Lin, Qian Guan-Qun, Fan Wei-Guo, Hua Kun, Zhang Li. Sentiment analysis based on light reviews. Journal of Software, 2014, 25(12):2790-2807 http://www.cnki.com.cn/Article/CJFDTOTAL-RJXB201412006.htm [10] Weng J S, Lim E P, Jiang J, He Q. TwitterRank:finding topic-sensitive influential twitterers. In:Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York, NY, USA:ACM, 2010.261-270 http://cn.bing.com/academic/profile?id=2159681701&encoded=0&v=paper_preview&mkt=zh-cn [11] Hong L J, Davison B D. Empirical study of topic modeling in twitter. In:Proceedings of the 1st Workshop on Social Media Analytics. New York, NY, USA:ACM, 2010.80-88 [12] Zhao W X, Jiang J, Weng J S, He J, Lim E P, Yan H F, Li X M. Comparing twitter and traditional media using topic models. Advances in Information Retrieval. Heidelberg, Berlin, Germany:Springer, 2011.338-349 [13] Gruber A, Weiss Y, Rosen-Zvi M. Hidden topic Markov models. In:Proceedings of the 11th International Conference on Artificial Intelligence and Statistics. San Juan, Puerto Rico:Omnipress, 2007.163-170 [14] Yan X H, Guo J F, Lan Y Y, Cheng X Q. A biterm topic model for short texts. In:Proceedings of the 22nd International Conference on World Wide Web. New York, NY, USA:ACM, 2013.1445-1456 [15] Riloff E, Patwardhan S, Wiebe J. Feature subsumption for opinion analysis. In:Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA:Association for Computational Linguistics, 2006.440-448 http://cn.bing.com/academic/profile?id=2241121518&encoded=0&v=paper_preview&mkt=zh-cn [16] Pang B, Lee L. Seeing stars:exploiting class relationships for sentiment categorization with respect to rating scales. In:Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA, USA:Association for Computational Linguistics, 2005.115-124 [17] Matsumoto S, Takamura H, Okumura M. Sentiment classification using word sub-sequences and dependency sub-trees. Advances in Knowledge Discovery and Data Mining. Heidelberg, Berlin, Germany:Springer, 2005:301-311 [18] Pang B, Lee L, Vaithyanathan S. Thumbs up? sentiment classification using machine learning techniques. In:Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing——Volume 10. Stroudsburg, PA, USA:Association for Computational Linguistics, 2002.79-86 [19] Titov I, McDonald R. Modeling online reviews with multi-grain topic models. In:Proceedings of the 17th International Conference on World Wide Web. New York, NY, USA:ACM, 2008.111-120 http://cn.bing.com/academic/profile?id=2096110600&encoded=0&v=paper_preview&mkt=zh-cn [20] Titov I, McDonald R T. A joint model of text and aspect ratings for sentiment summarization. In:Proceedings of ACL-08:HLT. Columbus, Ohio, USA:Association for Computational Linguistics, 2008.308-316 [21] Li F T, Huang M L, Zhu X Y. Sentiment analysis with global topics and local dependency. In:Proceedings of the 24th AAAI Conference on Artificial Intelligence. Carol Hamilton, USA:Association for the Advancement of Artificial Intelligence, 2010.1371-1376 [22] Wang H N, Lu Y, Zhai C X. Latent aspect rating analysis without aspect keyword supervision. In:Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA:ACM, 2011.618-626 http://cn.bing.com/academic/profile?id=2019207508&encoded=0&v=paper_preview&mkt=zh-cn [23] Moghaddam S, Ester M. ILDA:interdependent LDA model for learning latent aspects and their ratings from online product reviews. In:Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA:ACM, 2011.665-674 [24] Mukherjee S, Basu G, Joshi S. Joint author sentiment topic model. In:Proceedings of the 2014 SIAM International Conference on Data Mining. Philadelphia, PA, USA:SIAM, 2014.370-378 [25] Zhao W X, Jiang J, Yan H F, Li X M. Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid. In:Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA:Association for Computational Linguistics, 2010.56-65 [26] Li F T, Wang S, Liu S H, Zhang M. Suit:a supervised user-item based topic model for sentiment analysis. In:Proceedings of the 28th AAAI Conference on Artificial Intelligence. Carol Hamilton, USA:Association for the Advancement of Artificial Intelligence, 2014.1636-1642 [27] Moghaddam S, Ester M. The FLDA model for aspect-based opinion mining:addressing the cold start problem. In:Proceedings of the 22nd International Conference on World Wide Web. Republic and Canton of Geneva, Switzerland:International World Wide Web Conferences Steering Committee, 2013.909-918 [28] Zhang Y, Ji D H, Su Y, Wu H M. Joint naïve Bayes and LDA for unsupervised sentiment analysis. Advances in Knowledge Discovery and Data Mining. Heidelberg, Berlin, Germany:Springer, 2013.402-413 [29] Zhang Y, Ji D H, Su Y, Sun C. Sentiment analysis for online reviews using an author-review-object model. Information Retrieval Technology. Heidelberg, Berlin, Germany:Springer, 2011.362-371 [30] Moghaddam S, Ester M. On the design of LDA models for aspect-based opinion mining. In:Proceedings of the 21st ACM International Conference on Information and Knowledge Management. New York, NY, USA:ACM, 2012.803-812 http://cn.bing.com/academic/profile?id=1967274749&encoded=0&v=paper_preview&mkt=zh-cn [31] Li C T, Zhang J W, Sun J T, Chen Z. Sentiment topic model with decomposed prior. In:Proceedings of the 2013 SIAM International Conference on Data Mining. Philadelphia, PA:SIAM, 2013.767-775 [32] Wang X R, McCallum A. Topics over time:a non-Markov continuous-time model of topical trends. In:Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA:ACM, 2006.424-433 [33] Phan X H, Nguyen L M, Horiguchi S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In:Proceedings of the 17th International Conference on World Wide Web. New York, NY, USA:ACM, 2008.91-100 http://www.oalib.com/references/5692309 [34] Lim K W, Buntine W. Twitter opinion topic model:extracting product opinions from tweets by leveraging hashtags and sentiment lexicon. In:Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. New York, NY, USA:ACM, 2014.1319-1328 [35] Chang J, Boyd-Graber J L, Gerrish S, Wang C, Blei D M. Reading tea leaves:how humans interpret topic models. In:Proceedings of the 2009 Advances in Neural Information Processing Systems. San Diego, CA, USA:NIPS Foundation, Inc., 2009.288-296 [36] Xie P T, Xing E P. Integrating document clustering and topic modeling. In:Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence. Cambridge, MA, USA:Association for Uncertainty in Artificial Intelligence, 2013.