期刊文献+

基于词共现模型的垃圾邮件过滤方法研究 被引量:4

Spam Filter Based on Term Co-Occurrence Model
下载PDF
导出
摘要 垃圾邮件过滤就是对邮件做出是垃圾或非垃圾的判断。传统的表示邮件的方法是在向量空间模型基础上通过信息增益等特征选择方法提取一部分词来表示邮件内容,存在语义信息不足的问题。该文提出一种将传统方法和词共现模型结合起来表示邮件特征的新方法,再采用交叉覆盖算法对邮件进行分类得到邮件分类器。实验表明,该文提出的邮件过滤算法与传统方法相比提高了过滤性能,词共现选择的维度要比传统方法选择的维度更具有代表性。 The aim of spam filtering is to distinguish the spam and the ham. The traditional methods used vector space model and feature selection approaches to extract features representing the contents of emails. However, these methods do not take the semantic information among words into account. In this paper, a new method is proposed to extract email features by combining the vector space model and the term co-occurrence, The covering algorithm is then employed to classify emails. Experiments show that the proposed method significantly improves the filtering performances compared with traditional ones. The features selected by utilizing term co-occurrence model are more representative than those chosen by the vector space model.
出处 《中文信息学报》 CSCD 北大核心 2009年第6期61-66,71,共7页 Journal of Chinese Information Processing
基金 国家重点基础研究973计划资助项目(2004CB318108 2007CB311003) 国家自然科学基金资助项目(60675031) 教育部社科研究基金青年资助项目(07JC870006)
关键词 计算机应用 中文信息处理 向量空间模型 垃圾邮件过滤 词共现模型 交叉覆盖算法 computer application Chinese information processing vector space model spam filter term co-occurrence model covering algorithm
  • 相关文献

参考文献14

  • 1William W. Cohen. Fast effective rule induction[C]// Machine Learning Proceedings of the Twelfth International Conference on Machine Learning. Tahoe City, California, USA: Morgan Kaufmann, 1995: 115-123.
  • 2X. Carreras, L. Marquez. Boosting Trees for Anti Spam Email Filtering [C]//Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001). 2001: 58-64.
  • 3刘洋,杜孝平,罗平,等.垃圾邮件的智能分析、过滤及rough集讨论[C].第十二届中国计算机学会网络与数据通信学术会议,武汉,2002.
  • 4I. Androutsopoulos, G. Paliouras, V. Karkaletsis, etc, Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach[C]// Proc. 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000). 2000: 1-13.
  • 5H. Drueker, D. Wu, V. N. Vapnik, Support Vector Machines for Spam Categorization [ J/OL ]. IEEE Transactions on Neural Networks, 1999, 20 (5) : 1048-1054.
  • 6M. Sahami, S. Dumais, D. Heckerman etc, A Bayesian approach to filtering junk e-mail [C]//Proc. of AAAI Workshop on Learning for Text Categorization. 1998: 55-62.
  • 7刘伍颖,王挺.基于多过滤器集成学习的在线垃圾邮件过滤[J].中文信息学报,2008,22(1):67-73. 被引量:4
  • 8Peat H J, Willet P. The limitations of term co-occurrence data for query expansion in document retrieval systems [J/OL]. JASIS, 1991, 42(5):378-383.
  • 9G Salton, A Wong, C S Yang. On the specification of term values in automatic indexing [J/OL]. Journal of Documentation, 1973, 29(4) :351-372.
  • 10代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32. 被引量:228

二级参考文献57

  • 1李渝勤,孙丽华.基于规则的自动分类在文本分类中的应用[J].中文信息学报,2004,18(4):9-14. 被引量:20
  • 2张铃,张钹.多层反馈神经网络的FP学习和综合算法[J].软件学报,1997,8(4):252-258. 被引量:24
  • 3黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
  • 4M. DeSouza, J. Fitzgerald, C. Kempand G. Truong, A Decision Tree based Spam Filtering Agent[EB] . from http:∥www. cs. mu. oz. au/481/2001- projects/gntr/index. html, 2001.
  • 5N. Littlestone, Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm[J]. Machine Learning, 2(4) :285- 318, 1988[J].
  • 6R. Krishnamurthy and C. Orasan, A corpus-based investigation of junk emails[A]. In: Proceedings of Language Resources and Evaluation Conference (LREC 2002)[C]. Las Palmas de Gran Canaria, Spain, pp. 1773- 1780,May 2002.
  • 7M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian approach to filtering junk e-mail[A]. In:Proc. of AAAI Workshop on Learning for Text Categorization[C]. pp. 55-62, 1998.
  • 8W. Cohen, Fast effective rule induction[A]. In: Machine Learning Proceedings of the Twelfth International Conference[C]. Lake Taho, California, Mongan Kanfmann, pp. 115-123, 1995.
  • 9W. Cohen, Learning rules that classify email[A]. In: Proceedings of the AAAI spring symposium of Machine Learning in Information Access, Palo Alto[C]. California, pp. 18 - 25. 1996.
  • 10X. Carreras and L. Marquez, Boosting Trees for Anti-Spam Email Filtering[A]. In: Proceedings of Euro Conference Recent Advances in NLP (RANLP-2001)[C]. pp. 58-64, Sep. 2001.

共引文献486

同被引文献26

  • 1吴光远,何丕廉,曹桂宏,聂颂.基于向量空间模型的词共现研究及其在文本分类中的应用[J].计算机应用,2003,23(z1):138-140. 被引量:23
  • 2晋耀红.基于语境框架的文本相似度计算[J].计算机工程与应用,2004,40(16):36-39. 被引量:26
  • 3段震,鲁杰,张铃.基于交叉覆盖神经网络的车牌识别研究[J].安徽大学学报(自然科学版),2004,28(5):11-14. 被引量:7
  • 4申红,吕宝粮,内山将夫,井佐原均.文本分类的特征提取方法比较与改进[J].计算机仿真,2006,23(3):222-224. 被引量:28
  • 5曹淑梅.从多元评价着手,促素质教育良性发展.青年科学,2010,(01):180-180.
  • 6Salton G, Wong A, Yang C S. A vector space model for automatic indexing [C]//Communications of the ACM, 1975, Vol. 18 (11): 613-620.
  • 7Feldman R, Aumann Y, et al. Text Mining at the Term Level[C]//Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowl- edge Discovery. Nantes, France, 1998, 23-26.
  • 8Hammouda K, Kamel M. Efficient Phrase-based Doc ument Indexing for Web Document Clustering [J]. IEEE Transactions on Knowledge and Data Engineer- ing, 2004, 16(10): 1279-1296.
  • 9Thomas Hofmann. Unsupervised Learning by Proba- bilistic Latent Semantic Analysis[J]. Machine Learn- ing,2001, 42: 177-196.
  • 10Yunjae Jung, Haesun Park, Ding-Zhu Du, et al. A Decision Criterion for the Optimal Number of Clus- ters in Hierarchical Clustering[J]. Journal of Global Optimization, 2003, 25 : 91-111.

引证文献4

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部