期刊文献+

面向垃圾邮件过滤的典型机器学习算法比较研究

The comparison of spam filter based on generative model and discriminative model
下载PDF
导出
摘要 基于机器学习的垃圾邮件过滤技术是当前垃圾邮件过滤的主流方法。机器学习模型主要分为两类:以朴素贝叶斯(NB)为代表的生成模型和以逻辑回归模型(LR)、支持向量机模型(SVM)为代表的判别学习模型。以往对两种模型的研究都是针对某一种语言进行,对于模型的语言独立性与相关性研究较少。因此,在中文数据集和英文数据集上比较典型的生产模型和判别学习模型的过滤性能。比较Bogo(Bogo系统是基于贝叶斯算法的,它是典型的生成模型)、逻辑回归模型和松弛在线支持向量机(两种典型的判别学习模型)在中英文数据集上的过滤性能。其中:实验是在公开英文数据集TREC05p-1、TREC06p和公开中文数据集TREC06c、SEWM2011上进行。实验结果显示基于判别模型垃圾邮件过滤器性能明显优于基于生成模型,并且相同的模型在中文数据集上显示了较好的效果。 The model of spam filter which bases on machine learning is the main method of model of spam filter. Machine learning model is divided into two categories: the generative model which is representative by Naive Bayes and the discriminative model which is representative by Logistic Regression (LR) and Sup- port Vector Machine (SVM). Previous studies of two models are on a certain language, the studies of the independence of the language are less. Therefore, the article compared the performance of typical repre- sentative model and discriminative model on Chinese data set and English data set. The article compared the performance of Bogo which is generative model and Logistic Regression, Relaxed Online SVM which are two discriminative model. We choose the public English datasets: TREC05p-1, TREC06p; Public Chi- nese datasets: TREC06c, SEWM 2011, as the test dataset with immediate feedback. The discriminative model gives the better results than the generative model based on spam filter. And the same model gives the better results on the Chinese datasets. ROSVM gives the best performance on Chinese spam filter.
出处 《黑龙江工程学院学报》 CAS 2012年第2期65-69,共5页 Journal of Heilongjiang Institute of Technology
基金 黑龙江省教育厅科学技术研究(面上)项目(12511444)
关键词 生成模型 判别模型 中文垃圾邮件过滤 generative model discriminative model Chinese spam filter
  • 相关文献

参考文献16

  • 1P. Graham. A plan for spare. [EB/OL] http://www. paulgraham, com/spam, html, 2002.
  • 2V. Metsis, I. Androutsopoulos, G. Paliouras. Spamfiltering with Naive Bayes-which Naive Bayes[C]. ThirdConference on Email and Anti--Spam(CEAS), 2006.
  • 3P. GraharrL Better bayesian filtering. [EB/OL] http.-// www. paulgraham, com/better, html, 2003.
  • 4G. Hulten, J. Goodman. Tutorial on junk e-mail filte-ring, In ICML 2004.
  • 5T. Joachims. Text categorization with support vector machines: Learning with many relevant features[C]. In ECML 98: Proceedings of the 10th European Conference on Machine Learning, 1998 : 137-142.
  • 6T. Joachims. Making large-scale SVM learning practical [M]. Advances in kernel Methods Support Vector Learning, MIT-Press, 1999.
  • 7D. Sculley, G. Wachman. Relaxed online SVMs for spam filtering-A-. In The Thirtieth Annual ACM SIGIR Conference Proceedings, 2007.
  • 8J. Goodman, W. Yin. Online discriminative spam filter training[A]. In Proceedings of the Third Conference on Email and Anti-Spare (CEAS), 2006.
  • 9G. V. Cormack. University of waterloo participation in the TREC 2007 spare track[A]. In TREC 2007.. Pro- ceedings of the Sixteenth Text Retrieval Conference, 2007.
  • 10G. V. Cormaek, T. r. Lynam. TREC 2005 spare track overview[A], In The Fourteenth Text Retrieval Conference (TREC 2005) Proceedings, 2005.

二级参考文献13

  • 1G. Cormack, T. Lynam. TREC 2005 Spare Track Overview[C]//The Fourteenth Text REtrieval Conference (TREC 2005 ) Proceedings. Gaithersburg, MD, USA. 2005.
  • 2V. N. Vapnik. Statistical Learning Theory[M]. New York, USA: John Wiley & Sons, Inc. 1998:1-18.
  • 3A. Bratko, B. Filipi?, G.V. Cormack et al. Spare Filtering Using Statistical Data Compression Models [J]. The Journal of Machine Learning Research archive, 2006,7:2673-2698.
  • 4G. Hulten and J. Goodman. Tutorial on Junk E-mail Filtering[C]//The Twenty-First International Conference on Machine Learning (ICML 2004). 2004: (Invited Talk, http://research. microsoft. com/en-us/um/ people/joshuago/ icmltutorialannounce. htm).
  • 5D. Sculley, G. M. Wachman. Relaxed Online SVMs for Spam Filtering[C]//The 30th Annual International ACM SIGIR Conference (SIGIR' 07). New York, NY, USA:ACM, 2007:415-422.
  • 6J. Goodman and W. Yih. Online Discriminative Spare Filter Training[C]//Third Conference on Email and Anti-Spare (CEIAS 2006). Mountain View, California, USA. 2006: 113-115. (http://www. eeas. cc/ 2006/22. pdf).
  • 7D. Sculley. Advances in Online Learning-based Spam Filtering [D]. Medford, MA, USA: Tufts University. 2008.
  • 8P. Hayati, V. Potdar. Evaluation of spam detection and prevention frameworks for email and image spam: a state of art[C]//International Conference on Information Integration and web-based Applications and Services (iiWAS 2008) workshops: Proceedings of the 10th International Conference on Information Integration and Web based Applications & Services (AIIDE 2008). New York, NY, USA: ACM. 2008: 520-527.
  • 9G. V. Cormack, A. Bratko. Batch and Online Spam Filter Comparison. [C]//Third Conference on Email and Anti-Spam (CEAS 2006). Mountain View, California, USA. 2006.
  • 10J.M. M. Cruz, G. V. Cormack. Using old Spare and Ham Samples to Train Email Filters[C]//6th Conference on Email and Anti-Spam. in Mountain View, California, USA, 2009.

共引文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部