基于LDA模型的主题分析

石晶; 范猛; 李万龙

doi:10.3724/SP.J.1004.2009.01586

基于LDA模型的主题分析

doi: 10.3724/SP.J.1004.2009.01586

1.
长春工业大学计算机科学与工程学院长春 130012
2.
长春工业大学科研处长春 130012
3.
吉林大学计算机科学与技术学院长春 130012

详细信息

通讯作者:
石晶

中图分类号: TP301
计量
- 文章访问数: 2402
- HTML全文浏览量: 108
- PDF下载量: 2776
- 被引次数: 0
出版历程
- 收稿日期: 2008-07-16
- 修回日期: 2009-03-25
- 刊出日期: 2009-12-20

Topic Analysis Based on LDA Model

1.
College of Computer Science and Engineering, Changchun University of Technology, Changchun 130012;
2.
Department of Science and Research Administration, Changchun University of Technology, Changchun 130012;
3.
College of Computer Science and Technology, Jilin University, Changchun 130012

More Information

Corresponding author: SHI Jing

摘要

摘要: 在文本分割的基础上, 确定片段主题, 进而总结全文的中心主题, 使文本的主题脉络呈现出来, 主题以词串的形式表示. 为了分析准确, 利用LDA (Latent dirichlet allocation)为语料库及文本建模, 以Clarity度量块间相似性, 并通过局部最小值识别片段边界. 依据词汇的香农信息提取片段主题词, 采取背景词汇聚类及主题词联想的方式将主题词扩充到待分析文本之外, 尝试挖掘隐藏于字词表面之下的文本内涵. 实验表明, 文本分析的结果明显好于其他方法, 可以为下一步文本推理的工作提供有价值的预处理.
- 主题分析 /
- LDA模型 /
- 文本分割 /
- Gibbs抽样
Abstract: Topic spotting of segments is performed based on text segmentation and the main topic of the whole text is then generalized. Topics are represented by means of word clusters. LDA (Latent dirichlet allocation) is used to model corpora and text. Clarity is taken as a metric for similarity of blocks and segmentation points are identified by local minimum. The topic words of segments are extracted according to Shannon information. Words which are not distinctly in the analyzed text can be included to express the topics with the help of word clustering of background and topic words association. The signification behind the words are attempted to be digged out. Experiments tell that the result of analyzing is far better than those of other methods. Valuable pre-processing is provided for text reasoning.
- Topic analysis /
- latent dirichlet allocation (LDA) model /
- text segmentation /
- Gibbs sampling