摘要
针对文本分类中传统特征选择方法卡方统计量和信息增益的不足进行了分析,得出文本分类中的特征选择关键在于选择出集中分布于某类文档并在该类文档中均匀分布且频繁出现的特征词。因此,综合考虑特征词的文档频、词频以及特征词的类间集中度、类内分散度,提出一种基于类内类间文档频和词频统计的特征选择评估函数,并利用该特征选择评估函数在训练集每个类别中选取一定比例的特征词组成该类别的特征词库,而训练集的特征词库则为各类别特征词库的并集。通过基于SVM的中文文本分类实验表明,该方法与传统的卡方统计量和信息增益相比,在一定程度上提高了文本分类的效果。
The traditional feature selection method of chi-square test and information gain in text classification has its inherent defect. This paper analyzed the key of feature selection in text classification being to select feature words distributed evenly and frequently in each type of documents. This should consider not only the document frequency and term frequency of feature words, but also the inter class concentration degree and the intra class scatter degree of feature words. It proposed a feature selection evaluation function that is based on document frequency of within-class and between-class and term frequency statistics. The feature selection evaluation function could select a certain proportion of the feature words in each category of the training set to form the corresponding class of the feature word library. The entire feature word library of the training set could be composed by each of such classes as a result. It carried out the experiment of Chinese text classification based on SVM. The experimental results show that the proposed method improves the effectiveness of text classification to a certain extent, compared with the traditional chi-square test and information gain.
作者
赵婧
邵雄凯
刘建舟
王春枝
Zhao Jing;Shao Xiongkai;Liu Jianzhou;Wang Chunzhi(School of Computer Science,Hubei University of Technology,Wuhan 430068,China)
出处
《计算机应用研究》
CSCD
北大核心
2019年第8期2261-2265,共5页
Application Research of Computers
基金
国家自然科学基金面上资助项目(61772180)
关键词
文本分类
特征选择
分散度
集中度
频度
text classification
feature selection
distribution
concentration
frequency