摘要
本文提出了一种基于信息增益改进的信息增益特征选择选择方法。首先对数据集按类进行特征选择,减少数据集不平衡性对特征选取的影响。其次运用特征出现概率计算信息增益权值,降低低频词对特征选择的干扰。最后使用离散度分析特征在每类中的信息增益值,过滤掉高频词中的相对冗余特征,并对选取的特征应用信息增益差值做进一步细化,获取均匀精确的特征子集。通过对照不同算法的测评函数值,表明本文选取的特征子集具有更好的分类能力。
In this paper, based on information gain improved information gain feature selection in text. First class feature selection data set, reducing the imbalance of the data sets feature selection. Followed by the use of the characteristics of the calculated probability of occurrence information gain we reduce the low - frequency words feature selection interference. The final dispersion analysis feature information gain value in each category, to fil- ter out h - frequency words the relatively redundant features, and select the characteristics of the application of information gain the difference further refinement, to obtain uniform and accurate feature subset. Control algo- rithm evaluation function value, indicating that the paper selected feature subset has better classification ability.
出处
《山东农业大学学报(自然科学版)》
CSCD
北大核心
2013年第2期252-256,共5页
Journal of Shandong Agricultural University:Natural Science Edition
关键词
特征选择
文本分类
信息增益
Feature selection
text classification
information gain