摘要
提出一种过滤垃圾电子邮件的方法.通过tf-idf特征提取方法提取邮件的词汇特征,采用χ2特征选择方法选取有效的特征,并抽取几个具有明显区分能力的结构方面的特征,利用支持向量机算法对垃圾电子邮件进行自动过滤.对中科院中文垃圾邮件语料库(Cspam)的实验,识别正确率达到82%以上,另外,tf-idf词汇特征和结构特征搭配使用可以提高分类的正确率,表明此种方法能提高垃圾电子邮件过滤的准确性.
One method to filter spam was proposed. The tf-idf method was used to extract e-mail's lexical features. x^2 method was used to select effective features. The several structural features were extracted which could discriminate spain obviously. The support vector machine algorithm was adopted to filter spare automatically. By experimenting on dataset of Cspam, the evaluation value F is above 82%, the tf-idf lexical features and structural features combined can improve the classification accuracy, which proves that the method can approve the accuracy of filtering spam.
出处
《天津科技大学学报》
CAS
2010年第2期72-75,共4页
Journal of Tianjin University of Science & Technology