期刊文献+

中文文本分类的两步特征选择法 被引量:2

Two-step feature selection method on Chinese text categorization
下载PDF
导出
摘要 针对从文集全局角度评价特征重要性的传统特征选择方法可能忽略某些重要分类特征的问题,提出两步特征选择方法.该方法首先过滤掉类别关联性不强的特征;然后根据词的统计信息将词归为各个类别的区分词,找出每个类的分类特征的最优子集;最后,将各个类别的最优子集组合起来形成最终分类特征.实验采用朴素贝叶斯作为分类器,使用IG,ECE,CC,MI和CHI等5种特征选择公式对该方法与传统方法进行比较,得到分类性能宏平均指标对比分别为91.075%对86.971%,91.122%对86.992%,91.160%对87.470%,90.253%对86.061%,90.881%对87.006%.该方法在考虑分类特征信息的同时,尽量保留传统特征选择方法中好的特征,能更好地捕获分类信息. Due to the fact that the traditional feature selection methods may ignore some important categorization features because they are not important enough from the perspective of whole corpus, a new two- step feature selection method is proposed. Firstly, the features which don' t have a strong relationship with categories are filtered. Secondly, the words are categorized to be taken as the category discriminating words of each category according to their statistic information and the optimal sub-sets are founded for the categorizations feature of every category. Finally, the optimal sub-sets of all the categories are combined to form the final categorization features: In the experiments, naive Bayesian is taken as categorizer and the method is compared with the traditional method by using the feature selection formulas IG, ECE, CC, MI and CHI. And the Macro-F1 obtained by the formulas are 91. 075%, 91. 122%, 91. 160%, 90.253% , 90. 881% in proposed method and are 86. 971%, 86. 992%, 87. 470%, 86. 061%, 87. 006% in the traditional method. Considering the categorization feature information, the method preserves the good features in the traditional methods as far as possible and can capture the categorization information better.
出处 《计算机辅助工程》 2008年第3期76-80,共5页 Computer Aided Engineering
基金 国家自然科学基金(60703010) 重庆市自然科学基金(2006BB2374) 重庆市教委科学技术研究项目(KJ070519) 教育部回国留学人员启动基金(教外司留[2007]1109号)
关键词 两步特征选择 中文文本分类 类别区分词 朴素贝叶斯 two-step feature selection Chinese text categorization category discriminating word naive Bayesian
  • 相关文献

参考文献5

  • 1周茜,赵明生,扈旻.中文文本分类中的特征选择研究[J].中文信息学报,2004,18(3):17-23. 被引量:165
  • 2SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Commun ACM, 1975, 18(11) : 613-620.
  • 3YANG Y M, PEDERSON J O. A comparative study on feature selection in text categorization [ C ]//Proc 14th Int Conf on Machine Learning, Nashville, USA, 1997 : 412-420.
  • 4陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005,24(6):690-695. 被引量:79
  • 5HWEE T N, WEI B G, KOK L L. Feature selection, perception learning, and a usability case study for text categorization[ C]//Proc 20th ACM Int Conf on Res & Dev in Inform Retrieval, 1997:67-73.

二级参考文献33

  • 1Yang Yiming,Pederson J O.A Comparative Study on Feature Selection in Text Categorization [A].Proceedings of the 14th International Conference on Machine learning[C].Nashville:Morgan Kaufmann,1997:412-420.
  • 2Y.Yang.Noise reduction in a statistical approach to text categorization[A].Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR95)[C].Seattle:ACM Press,1995:256-263.
  • 3Thorsten Joachims,Text Categorization with Support Vector Machines:Learning with Many Relevant Features[A],In:European Conferrence on Machine Learning (ECML)[C].Berlin:Springer,1998,137-142.
  • 4Mlademnic,D.,Grobelnik,M.Feature Selection for unbalanced class distribution and Nave Bayees[A].Proceedings of the Sixteenth International Conference on Machine Learning[C].Bled:Morgan Kaufmann,1999:258-267.
  • 5梁久祯 兰东俊 扈旻.基于先验知识的网页特征压缩与线性分类器设计[A]..第十二届全国神经计算学术大会论文集[C].北京:人民邮电出版社,2002.494-501.
  • 6Apte C, Damerau F J, and Weiss S M. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994, 12:233- 251.
  • 7Yang Yiming, and Pedersen J O. A comparative study on feature selection in text categorization. In- Proceedings of the 14^th International Conference on Machine Learning (ICML-97), 1997. 412 - 420.
  • 8Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-97), 1997. 67 - 73.
  • 9Schutze H, Hull D A, and Pedersen J O. A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-95). 1995. 229 - 237.
  • 10Li Y H, and Jain A K. Classification of text document. The Computer Journal, 1998, 41(8) :537 - 546.

共引文献235

同被引文献9

引证文献2

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部