期刊文献+

基于统计频率的文本分类特征选择算法研究 被引量:3

A Feature Selection Method for Text Classification Based on Statistical Frequency
下载PDF
导出
摘要 通过分析χ2统计量(Chi-square,CHI)的缺陷和不足,针对它对低文档频的特征项不可靠,而且不能说明词条和类别的相关性的缺点,对其进行改进,提出统计频率(Statistical Frequency,SF)算法。实验结果表明,统计频率算法能够弥补这些不足,在文本分类中表现出良好的分类效果。 This paper analyzes Chi -square algorithm (CHI) , which is unreliable for low- document frequency, and can't show the pertinence for term and classification. A new Statistical Frequency algorithm (SF) is proposed according to the chief shortcomings. The experiments of the SF algorithm is validated by comparison, the results show that improved algorithm performs better.
出处 《现代图书情报技术》 CSSCI 北大核心 2008年第11期44-48,共5页 New Technology of Library and Information Service
基金 江苏省教育厅高校哲学社会科学基金项目"江苏高校数字图书馆引进资源的绩效评价与发展战略研究"(项目编号:08SJB8700004)的研究成果之一
关键词 文本分类 特征选择 KNN x^2统计量 Text categorization Feature selection KNN Chi - square
  • 相关文献

参考文献10

  • 1Yang Y M, Liu X. A re - examination of Text Categorization Methods. 22nd Annual International SIGIR [ J ] , In : Proceedings of the 22rid Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999:42 - 49.
  • 2张俊丽,张帆.改进KNN算法在垃圾邮件过滤中的应用[J].现代图书情报技术,2007(4):75-78. 被引量:14
  • 3北京大学计算语言学研究所[EB/OL].[2008-08-05].http://www.icl.pku.edu.cn/default-cn.asp.
  • 4Sahon G, Wong A, Yang C S. A Vector Model for Automatic Indexing[ J ]. Communication of ACM, 1975,18 ( 11 ) :613 - 620.
  • 5Salton G, McGill M J. Introduction to Modem Information Retrieval [M]. McGraw Hill, Computer Series, 1983.
  • 6Mladenic D, Grobelnik M. Feature Selection for Classification Based on Text Hierarchy [ C ]. In: Working Notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD'98), 1998.
  • 7Cover T M, Hart P E. Nearest Neighbor Pattern Classification [J]. IEEE Trans. Inform. Theory, 1967 ( 13 ) :23.
  • 8张俊丽,张帆.KNN-FCM聚类算法在中文搜索引擎文本过滤中的应用[J].图书与情报,2007(4):48-51. 被引量:2
  • 9Sakkis G, Androutsopoulos I. Stacking Classifiers for Anti - spam Filtering of Email [ C ]. In : Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2001:44 - 50.
  • 10Yang Y. An Evaluation of Statistical Approaches to Text Categorization[ J]. Information Retrieval, 1999,1 ( 1 ) :76 - 78.

二级参考文献22

  • 1徐洪伟,方勇,音春.垃圾邮件过滤技术分析[J].通信技术,2003,36(10):126-128. 被引量:14
  • 2王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量:129
  • 3张惟皎,刘春煌,李芳玉.聚类质量的评价方法[J].计算机工程,2005,31(20):10-12. 被引量:60
  • 4Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning, 1998
  • 5Li Baoli,Chen Yuzhong,Yu Shiwen. A Comparative Study on Automatic Categorization Methods for Chinese Search Engine. In:Proceedings of the Eighth Joint International Computer Conference, 2002 : 117- 120
  • 6Androutsopoulos I,Koutsias J, Chandrinos K V,Spyropoulos C D. An Experimental Comparison of Naive Bayesian and Keyword - Based Anti - Spare Filtering with Encrypted Personal E - mail Messages. In :Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000: 160-167
  • 7Cover T M, Hart P E. Nearest Neighbor Pattern Classification. IEEE Trans. Inform. Theory, 1967 ( 13 ) :23
  • 8Salton G, Wong A, Yang C S. A Vector Model for Automatic Indexing. Communication of ACM,1975,18( 11 ) :613 -620
  • 9Sahami M,Dumais S, Heckerman D, Horvitz E. A Bayesian Approach to Filtering Junk E-Mail. AAAI Technical Report, 1998(5) : 55 -62
  • 10Mitchell T M. Machine Learning. New York: McGraw- Hill, 1997

共引文献14

同被引文献32

引证文献3

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部