摘要:
文本对象所固有的多义性,面对短文本特征稀疏和上下文缺失的情况,现有处理方法无法明辨语义,形成了底层特征和高层表达之间巨大的语义鸿沟.本文尝试借由时间、空间、联系等要素挖掘文本间隐含的关联关系,重构文本上下文范畴,提升情感极性分类性能.具体做法对应一个两阶段处理过程:1)基于短文本的内在联系将其初步重组成上下文(领域);2)将待处理短文本归入适合的上下文(领域)进行深入处理.首先给出了基于Naive Bayes分类器的短文本情感极性分类基本框架,揭示出上下文(领域)范畴差异对分类性能的影响.接下来讨论了基于领域归属划分的文本情感极性分类增强方法,并将领域的概念扩展为上下文关系,提出了基于特殊上下文关系的文本情感极性判别方法.同时为了解决由于信息缺失所造成的上下文重组困难,给出基于遗传算法的任意上下文重组方案.理论分析表明,满足限制条件的前提下,基于上下文重构的情感极性判别方法能够同时降低抽样误差(Sample error)和近似误差(Approximation error).真实数据集上的实验结果也验证了理论分析的结论.
Abstract:
Synonymy and polysemy present a challenge to effective natural language processing, especially in the situations of context absence and sparse feature in short texts, widened semantic gap between low-level text features representation and high-level interpretation. In this work, short texts were reorganized into special context, i.e., the implied internal relationship such as time and space, and a novel two-step scheme for semantic orientation detection based on the special context was proposed. In the first step, the short texts were reorganized into special contexts by the implied internal relationship. In the second step, the unknown short text was categorized into a special context and labeled a polarity tag using the inner semantic orientation classifier. We firstly discussed the effect of special context after a sentiment classification framework based on naive Bayes classifier was presented. Then an enhancement classification method was given using field concept, which was expanded to special context. Finally, a special context reorganizing method was proposed based on genetic algorithm. Theoretical analysis shows the proposed methods can reduce the sample error and approximation error under some constraints. The experimental results in real corpora show the effectiveness of the proposed method.