期刊文献+

基于聚类的垃圾邮件识别技术研究 被引量:2

Research on spam detection techniques based on clustering
原文传递
导出
摘要 随着垃圾邮件数量日益攀升,如何有效识别垃圾邮件已成为一项非常重要的课题。为克服k最近邻(k-nea-rest neighbor,kNN)分类法在垃圾邮件识别中的缺陷,本文基于聚类算法提出了一种改进kNN识别方法。首先使用基于最小距离原则的一趟聚类算法将训练邮件集合划分为大小几乎相同的超球体,每个超球体包含一个类别或多个类别的文本;其次,采用投票机制对得到的聚类结果进行簇标识,即以簇中最多文本的类别作为簇的类别,得到的识别模型由具有标识的簇组成;最后,结合最近邻分类思想,对输入的邮件进行自动识别。实验结果表明,该方法可大幅度地降低邮件相似度的计算量,较TiMBL、Nave Bayesian、Stacking等算法效果要好。同时,该方法是一种可增量式更新识别模型的方法,具有一定的实用性。 With the surge of email spam,how to detect it becomes an important and urgent problem.To cope with the defects of kNN spam detection,an improved kNN spam detection approach based on clustering is proposed.First,by using the least distance principle,the training email text samples are divided into several hyper spheres with the approximate radius,and the texts contained in hyper spheres are from one or more of these categories.Second,the clusters(hyper spheres) are tagged by using the majority voting mechanism,which means that each cluster is tagged with the category containing the most text in the cluster,and the detection model consists of tagged clusters.Finally,the email texts are detected with the kNN approach.Experimental results show that the proposed approach can substantially reduce the text similarity computation,and perform better than iMBL,Nave Bayesian,and Stacking.Furthermore,the detection model constructed by the proposed approach can be incrementally updated,which has great feasibility in real-world applications.
出处 《山东大学学报(理学版)》 CAS CSCD 北大核心 2011年第5期71-76,共6页 Journal of Shandong University(Natural Science)
基金 国家自然科学基金资助项目(61070061) 广东省自然科学基金资助项目(9151026005000002) 广东省高层次人才项目 广东外语外贸大学研究生创新团队项目(10GWCXTD-08)
关键词 垃圾邮件识别 k最近邻文本分类 一趟聚类算法 增量式建模 spam detection; kNN text categorization; single pass clustering; incremental modeling;
  • 相关文献

参考文献9

  • 1ANDROUTSOPOULOS I, KOUTSIAS J, CHANDRI- NOS K V, et al. An evaluation of naive Bayesian anti- spam filtering [ C ]// Proceedings of the Workshop on Machine Learning in the New Information Age. New York: ACM Press, 2000: 9-17.
  • 2SAKKIS G, ANDROUTSOPOULOS I, PALIOURAS G, et al. A memory-based approach to anti-spam filtering for mailing lists [ J ]. Information Retrieval, 2003, 6 ( 1 ) :49-73,.
  • 3ANDROUTSOPOULOS I, PALIOURAS G, KARKA- LETSIS V, et al. Learning to filter spam E-mail: a com- parison of a naive bayesian and a memory-based approach [ C ]// Proceedings of the Workshop on Machine Learn- ing and Textual Information Access. New York: ACM Press, 2000: 1-13.
  • 4SAKKIS G, ANDROUTSOPOULOS I, PALIOURAS G, et al. Stacking classifiers for anti-spam filtering of E-mail [C]// Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing. [ S. l. ]: [s.n. ], 2001: 44-50.
  • 5KOPRINSKA I, POON J, CLARK J, et al. Learning to classify E-mail [ J]. Information Sciences, 2007, 177 (10) :2167-2187.
  • 6王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量:129
  • 7张泽明,罗文坚,王煦法.基于免疫原理的个性化Spam过滤算法[J].模式识别与人工智能,2007,20(3):406-414. 被引量:2
  • 8YANG Yiming, LIU Xin. A re-examination of text cate- gorization methods [ C ]//Proceedings 22nd ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1999 : 42-49.
  • 9JIANG SY, SONG X Y, HUI W. A Clustering-based method for unsupervised intrusion detections [ J ]. Pattern Recognition Letters, 2006, 27(5) :802-810.

二级参考文献48

  • 1李渝勤,孙丽华.基于规则的自动分类在文本分类中的应用[J].中文信息学报,2004,18(4):9-14. 被引量:20
  • 2张泽明,罗文坚,王煦法.一种基于人工免疫的多层垃圾邮件过滤算法[J].电子学报,2006,34(9):1616-1620. 被引量:16
  • 3M. DeSouza, J. Fitzgerald, C. Kempand G. Truong, A Decision Tree based Spam Filtering Agent[EB] . from http:∥www. cs. mu. oz. au/481/2001- projects/gntr/index. html, 2001.
  • 4N. Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm[J]. Machine Learning, 2(4) :285- 318, 1988[J].
  • 5R. Krishnamurthy and C. Orasan, A corpus-based investigation of junk emails[A]. In: Proceedings of Language Resources and Evaluation Conference (LREC 2002)[C]. Las Palmas de Gran Canaria, Spain, pp. 1773- 1780,May 2002.
  • 6M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian approach to filtering junk e-mail[A]. In:Proc. of AAAI Workshop on Learning for Text Categorization[C]. pp. 55-62, 1998.
  • 7W. Cohen, Fast effective rule induction[A]. In: Machine Learning Proceedings of the Twelfth International Conference[C]. Lake Taho, California, Mongan Kanfmann, pp. 115-123, 1995.
  • 8W. Cohen, Learning rules that classify email[A]. In: Proceedings of the AAAI spring symposium of Machine Learning in Information Access, Palo Alto[C]. California, pp. 18 - 25. 1996.
  • 9X. Carreras and L. Marquez, Boosting Trees for Anti-Spam Email Filtering[A]. In: Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001)[C]. pp. 58-64, Sep. 2001.
  • 10T. Nicholas, Using AdaBoost and Decision Stumps to Identify Spam E-mail[ EB]. Stanford University Course Project (Spring 2002/2003) Report, from http: ∥nlp. stanford. edu/courses/cs224n/2003/fp/.

共引文献129

同被引文献42

引证文献2

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部