摘要
针对客户端垃圾邮件过滤器难以获取足够训练样本的问题,提出一种基于小样本学习的垃圾邮件过滤方法,利用容易获取的未标记样本提高垃圾邮件过滤的性能。该方法使用已标记的小样本邮件实例集训练一个初始Na?veBayes分类器,以此标注未标记邮件,再使用所有数据训练新的分类器,利用EM算法进行迭代直至收敛。实验结果证明,当给定5个~20个已标记小样本训练邮件时,该方法可有效提高垃圾邮件过滤性能。
It is difficult to collect sufficient labeled E-mails for training a client spam classifier. Aiming at the problem, this paper proposes a spam filtering method based on learning from small samples, which improves the filtering performance with unlabeled samples. An initial Naive Bayes(NB) classifier is trained with a dataset of labeled E-mails, and unlabeled E-mails are probabilistically labeled with it. A new classifier is trained with all E-mails, and iterates to convergence with EM algorithm. Experimental results prove that, given labeled small training samples with a size of 5 to 20, the performance of spam filtering can be effectively improved.
出处
《计算机工程》
CAS
CSCD
北大核心
2010年第21期245-247,共3页
Computer Engineering
基金
国家“973”计划基金资助项目(2009CB326203)
国家自然科学基金资助项目(60975034)
安徽高等学校省级自然科学研究基金资助项目(KJ2009B238Z)