期刊文献+

文本分类中一种特征选择方法研究 被引量:10

Study on feature selection method in text classification
下载PDF
导出
摘要 针对文本分类中传统特征选择方法卡方统计量和信息增益的不足进行了分析,得出文本分类中的特征选择关键在于选择出集中分布于某类文档并在该类文档中均匀分布且频繁出现的特征词。因此,综合考虑特征词的文档频、词频以及特征词的类间集中度、类内分散度,提出一种基于类内类间文档频和词频统计的特征选择评估函数,并利用该特征选择评估函数在训练集每个类别中选取一定比例的特征词组成该类别的特征词库,而训练集的特征词库则为各类别特征词库的并集。通过基于SVM的中文文本分类实验表明,该方法与传统的卡方统计量和信息增益相比,在一定程度上提高了文本分类的效果。 The traditional feature selection method of chi-square test and information gain in text classification has its inherent defect. This paper analyzed the key of feature selection in text classification being to select feature words distributed evenly and frequently in each type of documents. This should consider not only the document frequency and term frequency of feature words, but also the inter class concentration degree and the intra class scatter degree of feature words. It proposed a feature selection evaluation function that is based on document frequency of within-class and between-class and term frequency statistics. The feature selection evaluation function could select a certain proportion of the feature words in each category of the training set to form the corresponding class of the feature word library. The entire feature word library of the training set could be composed by each of such classes as a result. It carried out the experiment of Chinese text classification based on SVM. The experimental results show that the proposed method improves the effectiveness of text classification to a certain extent, compared with the traditional chi-square test and information gain.
作者 赵婧 邵雄凯 刘建舟 王春枝 Zhao Jing;Shao Xiongkai;Liu Jianzhou;Wang Chunzhi(School of Computer Science,Hubei University of Technology,Wuhan 430068,China)
出处 《计算机应用研究》 CSCD 北大核心 2019年第8期2261-2265,共5页 Application Research of Computers
基金 国家自然科学基金面上资助项目(61772180)
关键词 文本分类 特征选择 分散度 集中度 频度 text classification feature selection distribution concentration frequency
  • 相关文献

参考文献5

二级参考文献43

  • 1单丽莉,刘秉权,孙承杰.文本分类中特征选择方法的比较与改进[J].哈尔滨工业大学学报,2011,43(S1):319-324. 被引量:25
  • 2陈治纲,何丕廉,孙越恒,郑小慎.基于向量空间模型的文本分类系统的研究与实现[J].中文信息学报,2005,19(1):36-41. 被引量:43
  • 3Yang Yi-ming, Pedersen J O. A Comparative Study on feature selection in text categorization [C]//Proceedings of the 14th In- ternational Conference on Machine Learning (ICML ' 97). Nash- villr = Morgan Kaufmann Publishers, 1997 412-420.
  • 4Ng H, Goh W, Low K. Feature selection, perceptron learning and a usability case study {or text categorization [C]//Procee- dings o{ the g0th ACM International Conference onResearch and Development in InformationRetrieval(SIGIR-97). 1997 : 67-73.
  • 5Wang Bin,Jones G J F, Pan Wen-feng. Using online linear clas- sifiers to filter spam emails[J]. Pattern Analysis Applica- tions, 2006,9(4) : 339-351.
  • 6Zheng Zhachui, Wu Xiao-yun, Srihari R. Feature Selection for Text Categorization on Imbalaneed Data[J]. ACM SIGKDD Ex- plorations Newsletter, 2004(6) : 80-89.
  • 7Xu Yan, Chen Lin. Term-frequency Based Feature Selection Methods for Text Categorization[C]//Proceedings of the 2010 Fourth International Conference on Genetic and Evolutionary Computing, Dec, 2010 : 280-283.
  • 8Robertson S E, Walker S, Jones S, et al. Okapi at tree-3 [C]// Gaithersburg M D. Proceedings of the Third Text Retrieval Conference (TR[C-3). USA= the National Inst. of Stan- dardsTechnology(NIST) &Defense Advanced Research Pro- jects Agency(DARPA). 1994 :109-126.
  • 9Hu Qing-hua, Yu Da-ren, Xie Zong-xia. Neighborhood classifiers [Z]. Scienc Edirect. Dec. 2006.
  • 10Marco Lippi, Manfred Jaeger, Paolo Frasconi, et al. Relationalinformation gain [J]. Machine Learning, 2011, 83 ( 2 ):219-239.

共引文献65

同被引文献80

引证文献10

二级引证文献73

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部