期刊文献+

基于小样本学习的垃圾邮件过滤方法 被引量:2

Spam Filtering Method Based on Learning from Small Samples
下载PDF
导出
摘要 针对客户端垃圾邮件过滤器难以获取足够训练样本的问题,提出一种基于小样本学习的垃圾邮件过滤方法,利用容易获取的未标记样本提高垃圾邮件过滤的性能。该方法使用已标记的小样本邮件实例集训练一个初始Na?veBayes分类器,以此标注未标记邮件,再使用所有数据训练新的分类器,利用EM算法进行迭代直至收敛。实验结果证明,当给定5个~20个已标记小样本训练邮件时,该方法可有效提高垃圾邮件过滤性能。 It is difficult to collect sufficient labeled E-mails for training a client spam classifier. Aiming at the problem, this paper proposes a spam filtering method based on learning from small samples, which improves the filtering performance with unlabeled samples. An initial Naive Bayes(NB) classifier is trained with a dataset of labeled E-mails, and unlabeled E-mails are probabilistically labeled with it. A new classifier is trained with all E-mails, and iterates to convergence with EM algorithm. Experimental results prove that, given labeled small training samples with a size of 5 to 20, the performance of spam filtering can be effectively improved.
出处 《计算机工程》 CAS CSCD 北大核心 2010年第21期245-247,共3页 Computer Engineering
基金 国家“973”计划基金资助项目(2009CB326203) 国家自然科学基金资助项目(60975034) 安徽高等学校省级自然科学研究基金资助项目(KJ2009B238Z)
关键词 小样本学习 EM算法 未标记数据 垃圾邮件过滤 learning from small samples EM algorithm unlabeled data spam filtering
  • 相关文献

参考文献6

  • 112321网络不良与垃圾信息举报受理中心.2009年第一季度中国反垃圾邮件状况调查报告[R].北京: 中国互联网协会,2009.
  • 2Stern H.A Survey of Modern Spam Tools[C]//Proceedings of the 15th Conference on E-mail and Anti-spam.Mountain View,California,USA: [s.n.],2008.
  • 3Talbot D.Where SPAM Is Born[J].Technology Review,2008,111(3): 28.
  • 4王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量:129
  • 5张羿,周建国,晏蒲柳.垃圾邮件过滤系统的研究与实现[J].计算机工程,2006,32(18):106-108. 被引量:9
  • 6Nigam K,Mccallum A K,Thrun S,et al.Text Classification from Labeled and Unlabeled Documents Using EM[J].Machine Learning.2000,39(2/3):103-134.

二级参考文献35

  • 1李渝勤,孙丽华.基于规则的自动分类在文本分类中的应用[J].中文信息学报,2004,18(4):9-14. 被引量:20
  • 2王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 3M. DeSouza, J. Fitzgerald, C. Kempand G. Truong, A Decision Tree based Spam Filtering Agent[EB] . from http:∥www. cs. mu. oz. au/481/2001- projects/gntr/index. html, 2001.
  • 4N. Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm[J]. Machine Learning, 2(4) :285- 318, 1988[J].
  • 5R. Krishnamurthy and C. Orasan, A corpus-based investigation of junk emails[A]. In: Proceedings of Language Resources and Evaluation Conference (LREC 2002)[C]. Las Palmas de Gran Canaria, Spain, pp. 1773- 1780,May 2002.
  • 6M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian approach to filtering junk e-mail[A]. In:Proc. of AAAI Workshop on Learning for Text Categorization[C]. pp. 55-62, 1998.
  • 7W. Cohen, Fast effective rule induction[A]. In: Machine Learning Proceedings of the Twelfth International Conference[C]. Lake Taho, California, Mongan Kanfmann, pp. 115-123, 1995.
  • 8W. Cohen, Learning rules that classify email[A]. In: Proceedings of the AAAI spring symposium of Machine Learning in Information Access, Palo Alto[C]. California, pp. 18 - 25. 1996.
  • 9X. Carreras and L. Marquez, Boosting Trees for Anti-Spam Email Filtering[A]. In: Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001)[C]. pp. 58-64, Sep. 2001.
  • 10T. Nicholas, Using AdaBoost and Decision Stumps to Identify Spam E-mail[ EB]. Stanford University Course Project (Spring 2002/2003) Report, from http: ∥nlp. stanford. edu/courses/cs224n/2003/fp/.

共引文献135

同被引文献14

  • 1白秋颖,章璿,张耀龙.基于网络会话层的垃圾邮件行为识别[J].计算机工程与应用,2007,43(1):167-169. 被引量:3
  • 2Terri O, Tony W. Developing and Immunity to Spam[ C]//In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2003 ). Chicago:[ s. n. ] ,2003.
  • 3Cristianini N, Shawe-Taylor. An introduction to Support Vec- tor Machines and other kernel-based learning methods [ M ]. Cambridge:Cambridge University Press, 2000.
  • 4CCERT Data Sets of Chinese Emails (CDSCE) [ EB/OL]. [ 2011-04-11 ]. http://www, ccert, edu. cn/spam/sa./data- sets. htm.
  • 5ICTCLAS [ EB/OL ]. [ 2011-04-11 ]. http ://ictclas. org/.
  • 6VapnikVN.统计学习理论的本质[M].北京:清华大学出版社,2000..
  • 7OSCAR P, VWANI R. Personal e-mail networks:an effective anti- spam tool [J]. IEEE Gomoutor,2005,38(4) :61-68.
  • 8WANG Hui, LIN Zhi-wei, McCIEAN, S, el al. Measuring similarity for multidimensional sequences[ C ]//Proc of IEEE Intematianal Confere- nce on Data Mining. 2010:281-287.
  • 9SONG Wen-he, MA Chun-xia. The study of thesis replica deteete method based on similarity of text [ C]//Proc of the 3rd IEEE Inter- national Conference on ComPuter Science and Information Technolo- gy. 2010:596-600.
  • 10赵治国,谭敏生,丁琳.垃圾邮件行为识别技术的研究与实现[J].计算机应用研究,2007,24(11):228-231. 被引量:9

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部