摘要
在中文短文本情感分析的特征提取中,词频逆文本频率指数算法TF-IDF存在特征词分布计算片面性的缺陷,信息增益算法IG不能很好地提取短文本特征,为此,提出了一种改进特征选择算法ITFIDF-IG。根据短文本语料特点提高更具分类效果的特征词权重,降低了无关词的干扰,并考虑特征词在分布上体现的分类效果,有效提取出更具分类贡献度的特征词,更适应中文短文本的情感分析,取得较好的分类性能。
In view of the shortcomings of the term frequency-inverse document frequency(TF-IDF) method for feature word distribution,the declining of information gain(IG) algorithm accuracy due to feature sparseness,as well as the drawback in computation because of the imbalanced distribution of text corpus.A sentiment analysis algorithm ITFIDF-IG based on the improved feature selection algorithm is proposed,which improves the weights of features according to their contributions to the classification implementation.By applying the proposed method into sentiment analysis of Chinese short text,it can effectively improve the contributions of features for classification,and reduce the interference from different numbers of texts among sets.The method is more suitable for Chinese short text sentiment analysis with better classification performance.
作者
王荣波
沈卓奇
黄孝喜
谌志群
WANG Rongbo;SHEN Zhuoqi;HUANG Xiaoxi;CHEN Zhiqun(Institute of Cognitive and Intelligent Computing,Hangzhou Dianzi University,Hangzhou Zhejiang 310018,China)
出处
《杭州电子科技大学学报(自然科学版)》
2019年第1期45-50,共6页
Journal of Hangzhou Dianzi University:Natural Sciences
基金
教育部人文社科规划青年基金资助项目(12YJCZH201)
教育部人文社会科学研究规划基金资助项目(18YJA740016)
关键词
特征选择
情感分析
词频逆文本频率指数
信息增益
中文短文本
feature selection
sentiment analysis
term frequency-inverse document frequency
information gain
Chinese short text