摘要
基于机器学习的垃圾邮件过滤技术是当前垃圾邮件过滤的主流方法。机器学习模型主要分为两类:以朴素贝叶斯(NB)为代表的生成模型和以逻辑回归模型(LR)、支持向量机模型(SVM)为代表的判别学习模型。以往对两种模型的研究都是针对某一种语言进行,对于模型的语言独立性与相关性研究较少。因此,在中文数据集和英文数据集上比较典型的生产模型和判别学习模型的过滤性能。比较Bogo(Bogo系统是基于贝叶斯算法的,它是典型的生成模型)、逻辑回归模型和松弛在线支持向量机(两种典型的判别学习模型)在中英文数据集上的过滤性能。其中:实验是在公开英文数据集TREC05p-1、TREC06p和公开中文数据集TREC06c、SEWM2011上进行。实验结果显示基于判别模型垃圾邮件过滤器性能明显优于基于生成模型,并且相同的模型在中文数据集上显示了较好的效果。
The model of spam filter which bases on machine learning is the main method of model of spam filter. Machine learning model is divided into two categories: the generative model which is representative by Naive Bayes and the discriminative model which is representative by Logistic Regression (LR) and Sup- port Vector Machine (SVM). Previous studies of two models are on a certain language, the studies of the independence of the language are less. Therefore, the article compared the performance of typical repre- sentative model and discriminative model on Chinese data set and English data set. The article compared the performance of Bogo which is generative model and Logistic Regression, Relaxed Online SVM which are two discriminative model. We choose the public English datasets: TREC05p-1, TREC06p; Public Chi- nese datasets: TREC06c, SEWM 2011, as the test dataset with immediate feedback. The discriminative model gives the better results than the generative model based on spam filter. And the same model gives the better results on the Chinese datasets. ROSVM gives the best performance on Chinese spam filter.
出处
《黑龙江工程学院学报》
CAS
2012年第2期65-69,共5页
Journal of Heilongjiang Institute of Technology
基金
黑龙江省教育厅科学技术研究(面上)项目(12511444)
关键词
生成模型
判别模型
中文垃圾邮件过滤
generative model
discriminative model
Chinese spam filter