基于聚类的垃圾邮件识别技术研究被引量：2

Research on spam detection techniques based on clustering

导出

摘要随着垃圾邮件数量日益攀升,如何有效识别垃圾邮件已成为一项非常重要的课题。为克服k最近邻(k-nea-rest neighbor,kNN)分类法在垃圾邮件识别中的缺陷,本文基于聚类算法提出了一种改进kNN识别方法。首先使用基于最小距离原则的一趟聚类算法将训练邮件集合划分为大小几乎相同的超球体,每个超球体包含一个类别或多个类别的文本;其次,采用投票机制对得到的聚类结果进行簇标识,即以簇中最多文本的类别作为簇的类别,得到的识别模型由具有标识的簇组成;最后,结合最近邻分类思想,对输入的邮件进行自动识别。实验结果表明,该方法可大幅度地降低邮件相似度的计算量,较TiMBL、Nave Bayesian、Stacking等算法效果要好。同时,该方法是一种可增量式更新识别模型的方法,具有一定的实用性。 With the surge of email spam,how to detect it becomes an important and urgent problem.To cope with the defects of kNN spam detection,an improved kNN spam detection approach based on clustering is proposed.First,by using the least distance principle,the training email text samples are divided into several hyper spheres with the approximate radius,and the texts contained in hyper spheres are from one or more of these categories.Second,the clusters（hyper spheres） are tagged by using the majority voting mechanism,which means that each cluster is tagged with the category containing the most text in the cluster,and the detection model consists of tagged clusters.Finally,the email texts are detected with the kNN approach.Experimental results show that the proposed approach can substantially reduce the text similarity computation,and perform better than iMBL,Nave Bayesian,and Stacking.Furthermore,the detection model constructed by the proposed approach can be incrementally updated,which has great feasibility in real-world applications.

作者蒋盛益庞观松张建军

机构地区广东外语外贸大学信息学院广东外语外贸大学国际工商管理学院海军工程大学理学院

出处《山东大学学报（理学版）》 CAS CSCD 北大核心 2011年第5期71-76,共6页 Journal of Shandong University(Natural Science)

基金国家自然科学基金资助项目(61070061) 广东省自然科学基金资助项目(9151026005000002) 广东省高层次人才项目广东外语外贸大学研究生创新团队项目(10GWCXTD-08)

关键词垃圾邮件识别 k最近邻文本分类一趟聚类算法增量式建模 spam detection； kNN text categorization； single pass clustering； incremental modeling；

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献9

1ANDROUTSOPOULOS I, KOUTSIAS J, CHANDRI- NOS K V, et al. An evaluation of naive Bayesian anti- spam filtering [ C ]// Proceedings of the Workshop on Machine Learning in the New Information Age. New York: ACM Press, 2000: 9-17.
2SAKKIS G, ANDROUTSOPOULOS I, PALIOURAS G, et al. A memory-based approach to anti-spam filtering for mailing lists [ J ]. Information Retrieval, 2003, 6 ( 1 ) :49-73,.
3ANDROUTSOPOULOS I, PALIOURAS G, KARKA- LETSIS V, et al. Learning to filter spam E-mail: a com- parison of a naive bayesian and a memory-based approach [ C ]// Proceedings of the Workshop on Machine Learn- ing and Textual Information Access. New York: ACM Press, 2000: 1-13.
4SAKKIS G, ANDROUTSOPOULOS I, PALIOURAS G, et al. Stacking classifiers for anti-spam filtering of E-mail [C]// Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing. [ S. l. ]: [s.n. ], 2001: 44-50.
5KOPRINSKA I, POON J, CLARK J, et al. Learning to classify E-mail [ J]. Information Sciences, 2007, 177 (10) :2167-2187.
6王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报,2005,19(5):1-10. 被引量：129
7张泽明,罗文坚,王煦法.基于免疫原理的个性化Spam过滤算法[J].模式识别与人工智能,2007,20(3):406-414. 被引量：2
8YANG Yiming, LIU Xin. A re-examination of text cate- gorization methods [ C ]//Proceedings 22nd ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1999 : 42-49.
9JIANG SY, SONG X Y, HUI W. A Clustering-based method for unsupervised intrusion detections [ J ]. Pattern Recognition Letters, 2006, 27(5) :802-810.

二级参考文献48

1李渝勤,孙丽华.基于规则的自动分类在文本分类中的应用[J].中文信息学报,2004,18(4):9-14. 被引量：20
2张泽明,罗文坚,王煦法.一种基于人工免疫的多层垃圾邮件过滤算法[J].电子学报,2006,34(9):1616-1620. 被引量：16
3M. DeSouza, J. Fitzgerald, C. Kempand G. Truong, A Decision Tree based Spam Filtering Agent[EB] . from http:∥www. cs. mu. oz. au/481/2001- projects/gntr/index. html, 2001.
4N. Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm[J]. Machine Learning, 2(4) :285- 318, 1988[J].
5R. Krishnamurthy and C. Orasan, A corpus-based investigation of junk emails[A]. In: Proceedings of Language Resources and Evaluation Conference (LREC 2002)[C]. Las Palmas de Gran Canaria, Spain, pp. 1773- 1780,May 2002.
6M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian approach to filtering junk e-mail[A]. In:Proc. of AAAI Workshop on Learning for Text Categorization[C]. pp. 55-62, 1998.
7W. Cohen, Fast effective rule induction[A]. In: Machine Learning Proceedings of the Twelfth International Conference[C]. Lake Taho, California, Mongan Kanfmann, pp. 115-123, 1995.
8W. Cohen, Learning rules that classify email[A]. In: Proceedings of the AAAI spring symposium of Machine Learning in Information Access, Palo Alto[C]. California, pp. 18 - 25. 1996.
9X. Carreras and L. Marquez, Boosting Trees for Anti-Spam Email Filtering[A]. In: Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001)[C]. pp. 58-64, Sep. 2001.
10T. Nicholas, Using AdaBoost and Decision Stumps to Identify Spam E-mail[ EB]. Stanford University Course Project (Spring 2002/2003) Report, from http: ∥nlp. stanford. edu/courses/cs224n/2003/fp/.

共引文献129

1张登科,易秀双,王兴伟.一种基于相似度测量的新垃圾邮件发现机制[J].中国海洋大学学报（自然科学版）,2008,38(S1):147-150. 被引量：1
2刘洋,曹津宁,刘昊,秦玉平.基于贝叶斯方法的垃圾邮件处理模型研究[J].长春工程学院学报（自然科学版）,2007,8(3):75-76.
3张平.追求[J].就业与保障,2005(11):1-1.
4王金宝.基于增量学习和阈值优化的自适应信息过滤研究[J].计算机应用,2006,26(5):1099-1101.
5庄锁法,陈兴梅.客户端防范垃圾邮件策略的探讨[J].电脑知识与技术,2006(8):172-172.
6张洪军,段会川.基于支持向量机的电子邮件分类模型设计[J].信息技术与信息化,2006(5):89-90. 被引量：1
7徐卫.一种垃圾邮件过滤网关的设计[J].电脑知识与技术,2006(12):64-65.
8黄鹏鹤.垃圾邮件内容过滤测试平台的设计与实现[J].仪器仪表用户,2007,14(1):93-94.
9陈超,陈盛雄.一种基于SMO算法的垃圾邮件过滤系统设计[J].福建电脑,2007,23(3):131-132. 被引量：1
10张俊丽,张帆.改进KNN算法在垃圾邮件过滤中的应用[J].现代图书情报技术,2007(4):75-78. 被引量：14

同被引文献42

1江小平,李成华,向文,张新访,颜海涛.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报（自然科学版）,2011,39(S1):120-124. 被引量：79
2黄永平,邹力鹍.数据仓库中基于密度的批量增量聚类算法[J].计算机工程与应用,2004,40(29):206-208. 被引量：9
3高小梅,冯云,冯兴杰.增量式K-Medoids聚类算法[J].计算机工程,2005,31(B07):181-183. 被引量：9
4徐新华,谢永红.增量聚类综述及增量DBSCAN聚类算法研究[J].华北航天工业学院学报,2006,16(2):15-17. 被引量：5
5王洪春,彭宏.基于模糊C-均值的增量式聚类算法[J].微电子学与计算机,2007,24(6):156-157. 被引量：22
6Olivier Chapelle,Bernhard Scholkopf,Alexander Zien.Semi-Supervised Learning[M].Chicago:Cambridge:The MIT Press,2006.
7Thorsten Joachims.transductive inference for text classification using support vector machines[C]//Association for Computing Ma-chinery.ICML’99 Proceedings of the Sixteenth International Conference on Machine.San Francisco:Morgan Kaufmann Publish-ers Inc,1999:200-209.
8胡彩平,秦小麟.一种改进的基于密度的抽样聚类算法[J].中国图象图形学报,2007,12(11):2031-2036. 被引量：4
9纪良浩.基于密度偏差抽样的聚类算法研究[J].重庆邮电大学学报（自然科学版）,2007,19(6):729-732. 被引量：2
10易宝林,伍仪强,丰大洋,张小莉.基于DBSCAN的批量更新聚类算法[J].计算机工程,2009,35(2):63-64. 被引量：6

引证文献2

1邝神芬.直推式支持向量机在垃圾邮件识别中的应用[J].韶关学院学报,2012,33(2):13-16.
2何玉林,黄哲学.大规模数据集聚类算法的研究进展[J].深圳大学学报（理工版）,2019,36(1):4-17. 被引量：9

二级引证文献9

1赵玉明,舒红平,魏培阳,刘魁.基于Spark的聚类算法优化与实现[J].现代电子技术,2020,43(8):52-55. 被引量：1
2张海华,李楠楠.基于大数据K-means聚类算法的在线学习行为路径的研究[J].电子设计工程,2020,28(12):17-20. 被引量：13
3杨锴,周岩.外引内联型科研团队创新能力纵向匹配研究[J].科技进步与对策,2020,37(11):145-152. 被引量：1
4纪汉霖,李兆信.多种聚类算法性能的比较分析[J].计算机技术与发展,2020,30(8):14-21. 被引量：11
5柴变芳,李有熠.基于Spark的主动重叠K-means聚类算法[J].微电子学与计算机,2021,38(1):70-76. 被引量：6
6符春.大数据平台聚类分析系统的设计[J].电子技术与软件工程,2022(13):202-205. 被引量：1
7黄乐成,陈超,韩存鑫,赵彬.基于改进K-means的大气污染物高维度信息研究[J].实验室研究与探索,2022,41(9):135-139.
8黄洪滔,肖梅,刘倩,明秀玲,边浩毅.基于GPS数据的公交站运行状态分析[J].深圳大学学报（理工版）,2023,40(3):326-334. 被引量：2
9崔喜贺,魏艳东.智慧安全用电管理系统的开发[J].上海电气技术,2023,16(2):15-19. 被引量：1

1董源,徐雅斌,李卓,李艳平.基于社会计算和机器学习的垃圾邮件识别方法的研究[J].山东大学学报（理学版）,2013,48(7):72-78. 被引量：2
2徐春玲.改进的贝叶斯分类对垃圾邮件识别探讨[J].现代商贸工业,2009,21(24):269-270.
3朱伟明.基于R语言的支持向量机在信息安全与垃圾邮件识别中的应用[J].机电工程技术,2016,45(12):53-57.
4邝神芬.直推式支持向量机在垃圾邮件识别中的应用[J].韶关学院学报,2012,33(2):13-16.
5Zhibin Zhao,Jiahong Sun,Lan Yao,Xun Wang,Jiahong Chu,Huan Liu,Ge Yu.Modeling Chinese Microblogs with Five Ws for Topic Hashtags Extraction[J].Tsinghua Science and Technology,2017,22(2):135-148.
6王友卫,朱建明,李洋,凤丽洲.基于增量学习和主动学习的垃圾邮件识别新方法[J].计算机科学,2015,42(B10):23-27.
7邱明明,吴国新.一种个性化垃圾邮件识别系统的设计[J].计算机技术与发展,2007,17(1):136-138. 被引量：4
8徐雅斌,李卓,董源.基于社会计算和机器学习的垃圾邮件快速过滤[J].系统工程理论与实践,2014,34(S1):179-186. 被引量：1
9邱宁佳,郭畅,杨华民,王鹏,温暖.基于MapReduce编程模型的改进KNN分类算法研究[J].长春理工大学学报（自然科学版）,2017,40(1):110-114. 被引量：3
10杜琳娜,闫光辉,杨霞霞,刘利松.一种改进的KNN中文文本分类算法[J].软件导刊,2010,9(2):51-53. 被引量：2

山东大学学报（理学版）

2011年第5期

浏览历史

内容加载中请稍等...

基于聚类的垃圾邮件识别技术研究被引量：2

参考文献9

二级参考文献48

共引文献129

同被引文献42

引证文献2

二级引证文献9

相关作者

相关机构

相关主题

浏览历史

基于聚类的垃圾邮件识别技术研究 被引量：2

参考文献9

二级参考文献48

共引文献129

同被引文献42

引证文献2

二级引证文献9

相关作者

相关机构

相关主题

浏览历史

基于聚类的垃圾邮件识别技术研究被引量：2