摘要
采用TF-IDF和Bernoulli两种模型构造邮件向量,首先详细测试了CHI降维策略对线性支持向量机进行邮件分类的影响。将基于核函数的支持向量机引入到垃圾邮件过滤中,对基于线性核、多项式核和径向基核的支持向量机在邮件分类中的准确率和训练时间进行了比较,分析了训练样本不平衡对分类的影响,并从理论上对实验结果进行了分析,实验结果证明基于径向基核函数的SVM分类器对垃圾邮件有较好的过滤效果。
The Support Vector Machine (SVM) based spam filter was summarized briefly. The mail vector was constructed on TF-IDF model and Bernoulli model, The effect to mail classification of CHI method to descend dimension was tested in detail. Kernel based SVM was introduced into spam filtering. The classification accuracy and training time of SVM based on linear kernel, polynomial kernel and radius basis function kernel were compared and analyzed, It was proposed and analyzed that the imbalance of training samples has great affect on the classification accuracy and the false positive ratio.
出处
《计算机应用》
CSCD
北大核心
2008年第2期424-427,共4页
journal of Computer Applications
基金
国家863计划项目(2002AA415270)
关键词
支持向量机
垃圾邮件过滤
核函数
特征选择
Support Vector Machine(SVM)
spam filtering
kernel function
feature selection