摘要
随着电子邮件的普及与应用,垃圾邮件的泛滥也越来越受到人们的关注。而如何进行邮件特征选择,是邮件分类中的重要问题。在介绍词频和倒文档频度的基础上,对几种常用的特征选择算法进行了分析和比较,针对现有特征选择算法过于机械的缺点,将关键字权重引入到邮件分类中,提出了一种基于关键词权重的TF*IDF特征选择改进算法,并进行了实验验证。实验结果表明,采用该算法改进后的贝叶斯过滤器具有更好的过滤效果。
With the development of network and computer, more and more spam e-mails affect our lives. This paper firstly introduced the current popular feature selection methods based on term frequency and inversed document frequency. Then it compared and analyzed the various feature extraction algorithms, and introduced a new extracted feature algorithm by using the advanced TF * IDF. Finally it completed the experimental verification with the PU1 corpus. The experiment results demonstrate that the advanced naive Bayes filter has better performance.
出处
《计算机应用研究》
CSCD
北大核心
2009年第6期2165-2167,共3页
Application Research of Computers
基金
河北省自然科学基金资助项目(F2008000877)