期刊文献+

基于贝叶斯算法的垃圾邮件过滤技术 被引量:9

Research on Bayes-Based Spam Filtering
下载PDF
导出
摘要 对基于朴素贝叶斯算法的垃圾邮件过滤技术进行了研究分析和实验验证.介绍了向量空间模型(VSM)方法以及特征向量抽取方法,推导和研究了引入“特征之间互相独立”假设的朴素贝叶斯分类算法.采用K次交叉验证的方法,以收集的一些邮件为语料,应用朴素贝叶斯分类算法,通过训练集计算得到类别的先验概率和特征项的类条件概率,并以此为基础对测试集中的邮件进行归属判断,以正确率和召回率为指标给出了实验结果. E-mail communications between people have been greatly affected by spare problem. In this paper, Naive Bayesian categorization algorithm is deduced and analyzed as well as its application and validation in the experiments of spam filtering. Firstly, the paper introduces Text categorization technique, including commonly used vector space model to represent the text and feature extraction methods, such as information gain and document frequency based method. What is more, the behavior of information gain method in the experiments is explained. Secondly, it deduces and analyzes Naive Bayesian with the premise of independence within features. Then, it uses mails collected before as corpus, utilize k-fold cross-validation, and applys the naive Bayesian in experiments. Based on probabilities and that of terms belonging to some category which are gained through training corpus, the paper categorizes mails from test corpus respectively. Finally, experimental result is shown by two indicators, precision and recall.
出处 《南京师范大学学报(工程技术版)》 CAS 2005年第4期61-64,共4页 Journal of Nanjing Normal University(Engineering and Technology Edition)
基金 江苏省自然科学基金资助项目(01KJD520005)
关键词 垃圾邮件 文本分类 向量空间模型 贝叶斯算法 spam, text categorization,vector space model, Bayes algorithm
  • 相关文献

参考文献7

  • 1许洪波.文本挖掘与机器学习.信息技术快报,2005,(2):1-14.
  • 2[2]Androutsopoulos I, Paliouras G, Michelakis E. Learning to Filter Unsolicited Commercial E-Mail [R]. Technical Report 2004/2, NCSR "Demokritos", 2004.
  • 3[3]McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering [EB/OL]. http://www.cs.cmu.edu/~mccallum/bow, 1996.
  • 4[4]Androutsopoulos I, Koutsias J, Chandrinos K V, et al. An evaluation of naive bayesian anti-spam filtering[C]// Potamias G, Moustakis V, Someren Van M, et al. Proceedings of the Workshop on Machine Learning in the New Information Age. Barcelona: 11th European Conference on Machine Learning (ECML 2000), 2000: 9-17.
  • 5[5]Sahami M. Using Machine Learning to Improve Information Access [EB/OL]. http://ai.stanford.edu/~sahami/bio.html, 1998.
  • 6[6]Sahami M, Dumais S, Heckerman D, et al. A bayesian approach to filtering junk e-mail[C]// Sahami Mehran, Craven Mark, Joachims Thorsten, et al. Learning for Text Categorization: Papers from the 1998 Workshop.[s.l.]: AAAI, 1998.
  • 7[7]Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers [J]. Machine Learning, 1997, 29:131-163.

共引文献8

同被引文献28

引证文献9

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部