摘要
信息增益方法从整个训练集角度进行特征赋权,该模式不适合构造类别特征向量。通过改进的朴素贝叶斯方法选择类别特征用于构造类别向量,再利用词频信息改进信息增益模型用于文本特征选择,改善了信息增益模型对于中频词信息利用不足问题,提出一种基于类别的文本特征加权改进模型。随后的文本分类试验表明,提出的加权模型相比较于传统的信息增益方法具有较好的文本分类效果。
The information gain method determines the weight of text feature in terms of the whole training set,but it does not suit to forming the categorisation eigenvector.We put forward an improved model of text feature weighting based on categorisation.Firstly,we use the improved Nave Bayes to select the categorisation features for constructing the categorisation vector.Secondly,we use word frequency to improve the information gain method for text feature selection,which ameliorates the problem of insufficient use of the information of medium frequency words in information gain method.The following test on text categorization shows that the weighting model presented in the paper has better text categorisation effect than the conventional information gain method.
出处
《计算机应用与软件》
CSCD
2010年第6期8-10,56,共4页
Computer Applications and Software
基金
国家自然科学基金资助项目(70571087)
关键词
文本分类
特征选择
贝叶斯方法
特征加权
Text categorization Feature selection Nave Bayes Feature weighting