面向垃圾邮件过滤的典型机器学习算法比较研究

The comparison of spam filter based on generative model and discriminative model

下载PDF

导出

摘要基于机器学习的垃圾邮件过滤技术是当前垃圾邮件过滤的主流方法。机器学习模型主要分为两类:以朴素贝叶斯(NB)为代表的生成模型和以逻辑回归模型(LR)、支持向量机模型(SVM)为代表的判别学习模型。以往对两种模型的研究都是针对某一种语言进行,对于模型的语言独立性与相关性研究较少。因此,在中文数据集和英文数据集上比较典型的生产模型和判别学习模型的过滤性能。比较Bogo(Bogo系统是基于贝叶斯算法的,它是典型的生成模型)、逻辑回归模型和松弛在线支持向量机(两种典型的判别学习模型)在中英文数据集上的过滤性能。其中:实验是在公开英文数据集TREC05p-1、TREC06p和公开中文数据集TREC06c、SEWM2011上进行。实验结果显示基于判别模型垃圾邮件过滤器性能明显优于基于生成模型,并且相同的模型在中文数据集上显示了较好的效果。 The model of spam filter which bases on machine learning is the main method of model of spam filter. Machine learning model is divided into two categories： the generative model which is representative by Naive Bayes and the discriminative model which is representative by Logistic Regression （LR） and Sup- port Vector Machine （SVM）. Previous studies of two models are on a certain language, the studies of the independence of the language are less. Therefore, the article compared the performance of typical repre- sentative model and discriminative model on Chinese data set and English data set. The article compared the performance of Bogo which is generative model and Logistic Regression, Relaxed Online SVM which are two discriminative model. We choose the public English datasets： TREC05p-1, TREC06p; Public Chi- nese datasets： TREC06c, SEWM 2011, as the test dataset with immediate feedback. The discriminative model gives the better results than the generative model based on spam filter. And the same model gives the better results on the Chinese datasets. ROSVM gives the best performance on Chinese spam filter.

作者丁华福王莹莹韩咏闵莉邹钰

机构地区哈尔滨理工大学计算机科学与技术学院黑龙江工程学院计算机科学与技术学院

出处《黑龙江工程学院学报》 CAS 2012年第2期65-69,共5页 Journal of Heilongjiang Institute of Technology

基金黑龙江省教育厅科学技术研究(面上)项目(12511444)

关键词生成模型判别模型中文垃圾邮件过滤 generative model discriminative model Chinese spam filter

分类号 TP393 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献16

1P. Graham. A plan for spare. [EB/OL] http://www. paulgraham, com/spam, html, 2002.
2V. Metsis, I. Androutsopoulos, G. Paliouras. Spamfiltering with Naive Bayes-which Naive Bayes[C]. ThirdConference on Email and Anti--Spam(CEAS), 2006.
3P. GraharrL Better bayesian filtering. [EB/OL] http.-// www. paulgraham, com/better, html, 2003.
4G. Hulten, J. Goodman. Tutorial on junk e-mail filte-ring, In ICML 2004.
5T. Joachims. Text categorization with support vector machines: Learning with many relevant features[C]. In ECML 98: Proceedings of the 10th European Conference on Machine Learning, 1998 : 137-142.
6T. Joachims. Making large-scale SVM learning practical [M]. Advances in kernel Methods Support Vector Learning, MIT-Press, 1999.
7D. Sculley, G. Wachman. Relaxed online SVMs for spam filtering-A-. In The Thirtieth Annual ACM SIGIR Conference Proceedings, 2007.
8J. Goodman, W. Yin. Online discriminative spam filter training[A]. In Proceedings of the Third Conference on Email and Anti-Spare (CEAS), 2006.
9G. V. Cormack. University of waterloo participation in the TREC 2007 spare track[A]. In TREC 2007.. Pro- ceedings of the Sixteenth Text Retrieval Conference, 2007.
10G. V. Cormaek, T. r. Lynam. TREC 2005 spare track overview[A], In The Fourteenth Text Retrieval Conference (TREC 2005) Proceedings, 2005.

二级参考文献13

1G. Cormack, T. Lynam. TREC 2005 Spare Track Overview[C]//The Fourteenth Text REtrieval Conference (TREC 2005 ) Proceedings. Gaithersburg, MD, USA. 2005.
2V. N. Vapnik. Statistical Learning Theory[M]. New York, USA: John Wiley & Sons, Inc. 1998:1-18.
3A. Bratko, B. Filipi?, G.V. Cormack et al. Spare Filtering Using Statistical Data Compression Models [J]. The Journal of Machine Learning Research archive, 2006,7:2673-2698.
4G. Hulten and J. Goodman. Tutorial on Junk E-mail Filtering[C]//The Twenty-First International Conference on Machine Learning (ICML 2004). 2004: (Invited Talk, http://research. microsoft. com/en-us/um/ people/joshuago/ icmltutorialannounce. htm).
5D. Sculley, G. M. Wachman. Relaxed Online SVMs for Spam Filtering[C]//The 30th Annual International ACM SIGIR Conference (SIGIR' 07). New York, NY, USA:ACM, 2007:415-422.
6J. Goodman and W. Yih. Online Discriminative Spare Filter Training[C]//Third Conference on Email and Anti-Spare (CEIAS 2006). Mountain View, California, USA. 2006: 113-115. (http://www. eeas. cc/ 2006/22. pdf).
7D. Sculley. Advances in Online Learning-based Spam Filtering [D]. Medford, MA, USA: Tufts University. 2008.
8P. Hayati, V. Potdar. Evaluation of spam detection and prevention frameworks for email and image spam: a state of art[C]//International Conference on Information Integration and web-based Applications and Services (iiWAS 2008) workshops: Proceedings of the 10th International Conference on Information Integration and Web based Applications & Services (AIIDE 2008). New York, NY, USA: ACM. 2008: 520-527.
9G. V. Cormack, A. Bratko. Batch and Online Spam Filter Comparison. [C]//Third Conference on Email and Anti-Spam (CEAS 2006). Mountain View, California, USA. 2006.
10J.M. M. Cruz, G. V. Cormack. Using old Spare and Ham Samples to Train Email Filters[C]//6th Conference on Email and Anti-Spam. in Mountain View, California, USA, 2009.

共引文献6

1郑晓霞,刘超,邹钰.基于逻辑回归模型的中文垃圾短信过滤[J].黑龙江工程学院学报,2010,24(4):36-39. 被引量：2
2赵静,刘培玉,陈孝礼.结合特征和非特征信息改进Nave Bayes及其应用[J].计算机应用研究,2011,28(2):514-516. 被引量：2
3邓蔚,秦志光,刘峤,程红蓉.抗好词攻击的中文垃圾邮件过滤模型[J].电子测量与仪器学报,2010,24(12):1146-1152. 被引量：5
4张爱文,陆上,安波.基于ARM平台的增量学习式垃圾短信判别分检系统[J].计算机应用与软件,2012,29(12):133-136.
5邹钰.基于逻辑回归模型的垃圾短信过滤系统的研究[J].数字技术与应用,2013,31(2):77-77. 被引量：2
6彭成,展万里,周晓红.基于随机森林的异常邮件检测方法研究与实现[J].湖南工业大学学报,2020,34(1):70-76. 被引量：3

1王庆幸,徐从富,何俊.基于Logistic回归的中文垃圾邮件过滤方法[J].计算机科学,2008,35(10):197-199.
2余承依.基于贝叶斯最小风险的垃圾邮件过滤技术[J].计算机时代,2009(5):53-55.
3赵海涛,魏延,赖敏,陈守刚.基于模糊支持向量机的中文垃圾邮件过滤方法[J].成都大学学报（自然科学版）,2010,29(2):133-136.
4熊志斌,刘冬.朴素贝叶斯在文本分类中的应用[J].软件导刊,2013,20(2):49-51. 被引量：12
5沈元辅,沈跃伍.基于多层grams的在线支持向量机的中文垃圾邮件过滤[J].中文信息学报,2015,29(1):126-132. 被引量：4
6李星,田莹,段海新.中文垃圾邮件过滤系统的实现和评估[J].大连理工大学学报,2005,45(z1):189-195. 被引量：5
7李军,齐浩亮,韩中元,雷国华.基于在线线性判别学习模型的垃圾邮件过滤方法[J].哈尔滨理工大学学报,2008,13(3):48-50.
8夏成锋.基于n-gram及SVM的中文垃圾邮件过滤[J].广东广播电视大学学报,2008,17(1):100-103.
94G成潮流定制不落后[J].中文信息（数字通讯）,2011(20):32-32.
10王晖.浅谈中文垃圾邮件过滤系统的设计与实现[J].成才之路,2009,0(2):76-77.

黑龙江工程学院学报

2012年第2期

浏览历史

内容加载中请稍等...

面向垃圾邮件过滤的典型机器学习算法比较研究

参考文献16

二级参考文献13

共引文献6

相关作者

相关机构

相关主题

浏览历史