摘要
有效和稳定的特征提取和特征表示是提高在线评论情感分析性能的重要因素。在常规的连续词袋性、触发对等特征的基础上,本文研究在线评论中固定搭配特征的提取与表示方法,提出结合互信息和平均互信息、基于粗糙集两种策略用于固定搭配特征提取,并从特征抽取方法的有效性和稳定性分析出发考查所抽取的固定搭配其内部及外部稳定性,并将经筛选的固定搭配特征融合于多种情感分析模型中进行情感分析。真实酒店评论数据上的实验表明,固定搭配特征的恰当表示和筛选有效改善情感分析模型的分类精度,此外研究发现评论中情感特征词分布不均衡情况下采用可变精度粗规则的提取策略有助于提高情感分析的分类精度。
Precise sentiment orientation classification models and the extraction of effective and stable features from the review context are two essential factors which can affect the pedormance of online review sentiment analysis.Among various complicated features due to language complexity,regular collocation features are found to play important roles in that their structured expressions and show great impact on the sentiment orientation aside from conventional word bag and trigger pair features.In order to extract the complicated features for online reviews sentiment analysis,two novel approaches are presented in this paper to capture effectively the regular collocation features from the review of corpora-mutual information and average mutual information combined.Regular collocation features extracted are incorporated into sentiment analysis models as inputs to implementing the review sentiment analysis.The experiment on real hotel online reviews achieve generally higher precision,improves the performance of SVM models by 0.34% and that of the Na'fve Bayes models by 1.27%,respectively.As for the extraction of regular collocation features,two aspects were considered as essential to expressing effectively the complicated constraint of the review sentiment orientation from (1) internal stability of the regular collocation structure,which accounts for the substantial existence of the regular collocation aside from traditional word bags or trigger pairs,and (2) external effectiveness of the regular collocations which accounts for the contribution to the sentiment orientation classification.The mutual information method used in this paper measures external effectiveness while the average mutual information computation and its filtering performs the measurement of internal stability of regular collocations.The rough set based method ensures the internal stability and external effectiveness by α approximation rough rule extraction strategy and a maximum likelihood estimate of the regular collocations distribution.On the implementation,the approach presented has the non-uniform distribution occurrence of the sentiment features within the review.Variable precision strategies on the rough sets approach was introduced instead of the original rough rule strategy.It was found in the experiments that variable precision strategies on the rough sets approach did achieve the best sentiment analysis performance 88.38% via SVM models by the threshold value 0.85.Those results show that in dealing with the online review with non-uniform distribution occurrence of sentiment features.The variable precision strategy avoids the true voice of the minority and helps discriminate the whole sentiment orientation of the review.When dealing with the online review with uniform distribution occurrence of the sentiment features,α approximation would be a better choice to replace the original maximum likelihood estimate in the pursuit of a better sentiment analysis.A combination of mutual information and average mutual information approach would also be an optional strategy in the pursuit of comparative performance but with less computation under the same condition.
出处
《管理工程学报》
CSSCI
北大核心
2014年第4期180-186,共7页
Journal of Industrial Engineering and Engineering Management
基金
国家自然科学基金资助项目(71202168
71271066)
中央高校基本科研业务费专项资金资助项目(HIT.NSRIF2010083)
黑龙江省教育厅科学技术研究资助项目(12511435)
关键词
情感分析
固定搭配特征提取
互信息与平均互信息
粗糙集
支持向量机
sentiment analysis
regular collocation features extraction
mutual information and average mutual information
rough sets
support vector machine