期刊文献+

一种基于后缀数组聚类(SAC)的中文垃圾邮件过滤方法 被引量:1

A Method of Chinese Spam Filtering Based on Suffix Array Clustering(SAC)
下载PDF
导出
摘要 贝叶斯算法在垃圾邮件过滤中应用广泛,但在中文垃圾邮件过滤中性能较低。本文通过聚类的思想,提出一种基于后缀数组聚类(SAC)的中文邮件特征项抽取方法,并给出了不同特征项抽取方法下贝叶斯算法的中文垃圾邮件过滤实验数据对比。实验表明,该方法显著提高了中文垃圾邮件的过滤性能。 The naivebayes algorithm has widely been applied to spam filtering. However,it has unsatisfactory performance in Chinese email filtering. Using clutering, this paper proposes a suffix array clustering based token extraction method for Chinese email,named SAC. It also shows the different filtering results of bayes under different token extraction methods. The experiments domenstrate the improvement of filtering performance of the method for Chinese sparn.
出处 《计算机科学》 CSCD 北大核心 2006年第5期107-109,112,共4页 Computer Science
关键词 朴素贝叶斯 垃圾邮件过滤 后缀数组 Naive-bayes, Spare filtering, Suffix array clustering
  • 相关文献

参考文献11

  • 1Sahami M,Dumais S,Heckerman D,et al.A Bayesian Approach to Filtering Junk E-mail.In:AAAI Workshop on Learning for Text Categorization,Madis on,Wisconsin,1998.55~62
  • 2Graham P.Better Bayesian filtering.URL:http://paulgraham.com/better.html,2003
  • 3Graham P.A Plan for Spam.URL.http://paulgraham.com/spam.html,2002
  • 4Segal R,Crawford J,Kephart J,et al.SpamGuru:An Enterprise Anti-Spam Filtering System.In:Proceedings of First Conference on Email and Anti-Spam (CEAS),Mountain View,CA.2004.URL:http://www.ceas.cc/papers-2004/126.pdf
  • 5李国栋,李卫.基于文本分类技术的垃圾邮件识别系统[J].微电子学与计算机,2004,21(6):145-146. 被引量:10
  • 6刘新斌,李俊.一种基于N-gram组合的中文垃圾邮件过滤方法[J].微电子学与计算机,2004,21(12):85-91. 被引量:5
  • 7Zamir O,Etzioni O.Grouper:A dynamic clustering interface to web search results.Eighth International World Wide Web Conference,TorontTo,1999
  • 8Gusfield D.Algorithms on Strings,Trees,and Sequences:Computer Science and Computational Biology.furst edition.In:New York,USA:published by the press syndicate of the university of Cambridge,1997.90~91
  • 9Ukkonen E.On-line construction of suffix-trees.Algorithmica,1995,14(3):249~260
  • 10Manber U,Myers G.Suffix arrays:A new method for on-line string searches.In:Proceedings of the First Annual ACM_ SIAM Symposium on Discrete Algorithms,1990.319~327

二级参考文献7

  • 1.RFC822,RFC1341,RFC2045等MIME相关规范.[S].,..
  • 2Jiawei Han,Micheline Kamber.数据挖掘-概念与技术.高等教育出版社,2001.5.
  • 3G F Cooper and E Herskovtis. A Bayesian method for the induction of probabilistic network from data. Machine Learning, 1992,10.
  • 4中国互联网协会.中国互联网协会反垃圾邮件规范[Z].,2003..
  • 5国家标准GB13715.信息处理用现代汉语分词规范.北京:中国标准出版社,1992.
  • 6冯志伟.确定切词单位的某些语法因素.Journal of Chinese Language and Computer,Singapore[Z].,2001..
  • 7张华平,刘群.基于N-最短路径方法的中文词语粗分模型[J].中文信息学报,2002,16(5):1-7. 被引量:99

共引文献13

同被引文献6

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部