
统计频率算法在文本信息过滤系统中的应用 被引量:4

A feature Selection Method for Text Information Filtering Based on Statistical Frequency
摘要 文本信息过滤技术中的一个重要问题是对文档进行特征选择,分析χ2统计量(Chi-square,CHI)的缺陷和不足,针对它对低文档频的特征项不可靠,不能说明词条和类别的相关性等缺点,进行改进,提出一种新的统计频率(Statistical Frequency,SF)算法,并将此算法应用到文本信息过滤系统中。实验结果表明,统计频率算法能够弥补上述不足,表现出良好的过滤效果。 One of the most important problems in text information filtering technology is feature selection, this paper analyzes Chi - square algorithm(CHI) , which is unreliable for low -document frequency, and can't show the pertinence for term and classification. A new Statistical Frequency algorithm (SF) is proposed and applied to text information filtering system. The experiments of the SF algorithm is validated by comparison, the results show that improved algorithm performs well.
作者 张帆 张俊丽
出处 《图书情报工作》 CSSCI 北大核心 2009年第13期116-119,共4页 Library and Information Service
基金 2006年国家社会科学基金项目"网络信息过滤研究"(项目编号:06BTQ024)研究成果之一
关键词 文本过滤 特征选择 X^2统计量 text categorization feature selection chi -square
  • 相关文献


  • 1Mladenic D,Grobelnik M. Feature Selection for Classification Based on Text Hierarchy//Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery ( CONALD 98), 1998.
  • 2Yang Y M, Liu X. A re - examination of text categorization methods. 22nd Annual International SIGIR, 1999:42 -49.
  • 3张俊丽,张帆.改进KNN算法在垃圾邮件过滤中的应用[J].现代图书情报技术,2007(4):75-78. 被引量:14
  • 4北京大学计算语言学研究所主页.[2008-09-04].http://www.icl.pku.edu.cn/default_cn.asp.
  • 5Salton G, Wong A, Yang C S. A Vector Model for Automatic Indexing. Communication of ACM,1975,18( 11 ) :613 -620.
  • 6Salton G,McGill M J. Introduction to Modem Information Retrieval. New York : McGraw Hill, 1983.
  • 7Cover T M, Hart P E. Nearest neighbor pattern classification. IEEE Trans. Inform. Theory, 1967 ( 13 ) :23.
  • 8Hull D A. The TREC - 6 filtering track : Description and analysis// The 6th Text Retrieval conference ( TREC - 6 ), NIST SP 500 - 240,1998:45 - 68.
  • 9Belkin N J, Croft W B. Information filtering and information retrieval : Two sides of the same coin//Proceedings of Communications of the ACM, 1992,35 ( 12 ) :29 - 38.
  • 10黄萱菁,夏迎炬,吴立德.基于向量空间模型的文本过滤系统[J].软件学报,2003,14(3):435-442. 被引量:92


  • 1徐洪伟,方勇,音春.垃圾邮件过滤技术分析[J].通信技术,2003,36(10):126-128. 被引量:14
  • 2王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量:129
  • 3Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning, 1998
  • 4Li Baoli,Chen Yuzhong,Yu Shiwen. A Comparative Study on Automatic Categorization Methods for Chinese Search Engine. In:Proceedings of the Eighth Joint International Computer Conference, 2002 : 117- 120
  • 5Androutsopoulos I,Koutsias J, Chandrinos K V,Spyropoulos C D. An Experimental Comparison of Naive Bayesian and Keyword - Based Anti - Spare Filtering with Encrypted Personal E - mail Messages. In :Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000: 160-167
  • 6Cover T M, Hart P E. Nearest Neighbor Pattern Classification. IEEE Trans. Inform. Theory, 1967 ( 13 ) :23
  • 7Salton G, Wong A, Yang C S. A Vector Model for Automatic Indexing. Communication of ACM,1975,18( 11 ) :613 -620
  • 8Sahami M,Dumais S, Heckerman D, Horvitz E. A Bayesian Approach to Filtering Junk E-Mail. AAAI Technical Report, 1998(5) : 55 -62
  • 9Mitchell T M. Machine Learning. New York: McGraw- Hill, 1997
  • 10Salton G, McGill M J. Introduction to Modern Information Retrieval.McGraw Hill, Computer Series, 1983












使用帮助 返回顶部