摘要
针对在文本分类中信息增益特征选择方法等比例地组合正相关特征和负相关特征导致分类精度下降的问题,引入比例因子,提出一种自适应的方法。为信息增益添加合适的比例因子,结合经典的朴素贝叶斯算法,自动调节比例因子,使改进的信息增益适用于不同的语料库。实验结果表明,该方法能够为不同数目的特征空间选择较好的比例因子,为不同的文本集选择合适的比例因子,改进的信息增益在平衡数据集和非平衡数据集上的都有较好的分类效果。
The information gain method for feature selection in text categorization combines the characteristics of the positive correlation and the negative correlation with equal proportion, which causes the decline of the classification precision. Therefore, the scaling factor was introduced, and an adaptive method was proposed. The appropriate scaling factor for the information gain was added and the classic naive Bayesian algorithm was combined, the scaling factor was automatically adjusted to make the improved information gain suitable for the different eorpuses. The experimental results show that this method can choose the appropriate scaling factor for not only the feature space with different numbers, but also the different text sets, and the improved information gain has better classification effect on both the balance data set and the unbalanced ones.
出处
《计算机工程与设计》
CSCD
北大核心
2014年第8期2856-2859,2885,共5页
Computer Engineering and Design
基金
国家863高技术研究发展计划基金项目(2011AA01A107)
关键词
文本分类
信息增益
特征选择
比例因子
自适应
text categorization
information gain
feature selection
proportional factor
self-adapter