摘要
贝叶斯算法在垃圾邮件过滤中应用广泛,但在中文垃圾邮件过滤中性能较低。本文通过聚类的思想,提出一种基于后缀数组聚类(SAC)的中文邮件特征项抽取方法,并给出了不同特征项抽取方法下贝叶斯算法的中文垃圾邮件过滤实验数据对比。实验表明,该方法显著提高了中文垃圾邮件的过滤性能。
The naivebayes algorithm has widely been applied to spam filtering. However,it has unsatisfactory performance in Chinese email filtering. Using clutering, this paper proposes a suffix array clustering based token extraction method for Chinese email,named SAC. It also shows the different filtering results of bayes under different token extraction methods. The experiments domenstrate the improvement of filtering performance of the method for Chinese sparn.
出处
《计算机科学》
CSCD
北大核心
2006年第5期107-109,112,共4页
Computer Science
关键词
朴素贝叶斯
垃圾邮件过滤
后缀数组
Naive-bayes, Spare filtering, Suffix array clustering