-
摘要: 在文本分割的基础上, 确定片段主题, 进而总结全文的中心主题, 使文本的主题脉络呈现出来, 主题以词串的形式表示. 为了分析准确, 利用LDA (Latent dirichlet allocation)为语料库及文本建模, 以Clarity度量块间相似性, 并通过局部最小值识别片段边界. 依据词汇的香农信息提取片段主题词, 采取背景词汇聚类及主题词联想的方式将主题词扩充到待分析文本之外, 尝试挖掘隐藏于字词表面之下的文本内涵. 实验表明, 文本分析的结果明显好于其他方法, 可以为下一步文本推理的工作提供有价值的预处理.Abstract: Topic spotting of segments is performed based on text segmentation and the main topic of the whole text is then generalized. Topics are represented by means of word clusters. LDA (Latent dirichlet allocation) is used to model corpora and text. Clarity is taken as a metric for similarity of blocks and segmentation points are identified by local minimum. The topic words of segments are extracted according to Shannon information. Words which are not distinctly in the analyzed text can be included to express the topics with the help of word clustering of background and topic words association. The signification behind the words are attempted to be digged out. Experiments tell that the result of analyzing is far better than those of other methods. Valuable pre-processing is provided for text reasoning.
点击查看大图
计量
- 文章访问数: 2402
- HTML全文浏览量: 108
- PDF下载量: 2776
- 被引次数: 0