摘要
针对朴素贝叶斯算法应用于反垃圾邮件过滤时,其有效性十分依赖于对邮件内容的有效建模,而邮件内容建模方面研究尚不成熟限制了贝叶斯方法在垃圾邮件过滤中的性能.采用了三种概率分布对邮件内容进行建模,据此提出了3种概率分布下的朴素贝叶斯算法.为了提高训练效率,算法采用了一种增量式的垃圾邮件过滤方法.在trec05p-1、trec06p两个公开数据集上对这3种贝叶斯算法进行了实验对比,分析出三种贝叶斯分布的适用范围.从不同分布的邮件内容建模角度出发,为过滤垃圾邮件的方法选择提供了有效依据.
Abstract:The effectiveness of Naive Bayes in spare filtering depends on the modelling of the mail contents. However, mail content modelling is not mature, which limits the performance of Bayesian method in spam filtering. This paper presents three kinds of probability distribution to model email content, and proposes three Na'gve Bayes algorithms based on different probability distributions. To improve training efficiency, the incremental training algo- rithm is utilized in the experimental procedure. Experiments on trec06p and trec05p - 1 show that the three pro- posed algorithms can achieve good performance in different sceneries. Such a finding also provides effective basis for the selection of the filtering methods.
出处
《哈尔滨理工大学学报》
CAS
2014年第1期49-53,共5页
Journal of Harbin University of Science and Technology
基金
黑龙江省普通高等学校新世纪优秀人才培养计划(1155-ncet-008)
教育部人文社科项目(11YJC740048)
黑龙江省教育科学规划课题(GBC1211062)
黑龙江省高等教育教学改革项目(2011-NP33)
关键词
邮件过滤
朴素贝叶斯
机器学习
e-mail fiherring
naive bayes
machine learning