摘要
比较研究了与类别信息无关的文档频率和与类别信息有关的信息增益、互信息和χ2统计特征选择方法,在此基础上分析了以往直接组合这两类特征选择方法的弊端,并提出基于相关性和冗余度的联合特征选择算法。该算法将文档频率方法分别与信息增益、互信息和χ2统计方法联合进行特征选择,旨在删除冗余特征,并保留有利于分类的特征,从而提高文本情感分类效果。实验结果表明,该联合特征选择方法具有较好的性能,并且能够有效降低特征维数。
Based on a comparative study of four feature selection methods,including document frequency(DF) unrelated to class information,and information gain(IG),mutual information(MI) and chi-square statistic(CHI),which are relatedto class information,we analyzed the disadvantages of combining these two kinds of methods directly and proposed a joint feature selection method based on relevance and redundancy to joint DF and one of IG,MI and CHI.This approach aims to eliminate redundant features,find useful features for classification and consequently improve the accuracy of text sentiment classification.The results of the experiment show that the proposed method can not only improve the performance but also reduce the feature dimension.
出处
《计算机科学》
CSCD
北大核心
2012年第4期181-184,共4页
Computer Science
基金
国家自然科学基金(60903225)
国防科技大学优秀研究生创新基金(S100502)资助
关键词
文本情感分类
联合特征选择
相关性
冗余特征
Text sentiment classification
Joint feature selection
Relevance
Redundant feature