期刊文献+

代价敏感多主题学习的邮件过滤算法 被引量:2

A spam filtering algorithm based on cost sensitive learning for multiple topics
原文传递
导出
摘要 针对传统邮件分类模型中较少对邮件主题进行描述和分析的问题,提出一种代价敏感多主题学习的邮件分类算法,用以实现垃圾邮件过滤.首先,基于LDA(潜在狄利克雷分布)对邮件的多个主题进行提取,对邮件语义进行描述;其次,利用CS-SVM(代价敏感支持向量机)对邮件进行代价敏感学习,实现对不同类别邮件的惩罚;最后,结合MI-SVM(多示例支持向量机)进行代价敏感的多主题学习,实现邮件分类.实验采用四组ling-spam处理数据集.实验结果证明:该分类算法较比传统邮件分类算法,可以取得更高的准确性、特异性与敏感性. To solve the problem of the lack of description and analysis of topics in traditional email classification model,this paper proposed a spam filtering algorithm based on cost sensitive learning for multiple topics.Firstly,multiple topics were extracted by using LDA to describe the semantics of email.Secondly,CS-SVM was used for cost sensitive learning,which was used to penalize different kinds of emails.Finally,MI-SVM combining with CS-SVM was used to learn multiple topics for email classification.There were four sub-datasets of Ling-Spam used in the experiments.Experimental results show that compared to the traditional classification methods,the presented algorithm has better performance of accuracy,specificity and sensitivity.
作者 张绍成 刘威 程子傲 王丹华 Zhang Shaocheng;Liu Wei;Cheng Ziao;Wang Danhua(Informatization Center,Liaoning University,Shenyang 110036,China;Information Network Center,Shenyang Jianzhu University,Shenyang 110168,China)
出处 《华中科技大学学报(自然科学版)》 EI CAS CSCD 北大核心 2016年第S1期176-180,共5页 Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金 国家自然科学基金资助项目(61502092)
关键词 潜在狄利克雷分布 支持向量机 垃圾邮件过滤 文本分类 多示例学习 latent Dirichlet allocation support vector machine spam filtering text classification multiple-instance learning
  • 相关文献

参考文献1

二级参考文献16

  • 1姜远,周志华.基于词频分类器集成的文本分类方法[J].计算机研究与发展,2006,43(10):1681-1687. 被引量:22
  • 2Dietterich T G. Ensemble methods in machine learning [C] // Proc of the Multiple Classifier Systems. London: Springer, 2000:1-15.
  • 3Liu Wuying, Wang Ting. Multi-field learning for email spam filtering [C] //Proc of the 33rd Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2010: 745-746.
  • 4Fabrizio S. Machine learning in automated text categorization [J]. ACM Computing Surveys, 2002, 34(1): 1-47.
  • 5Drucker H, Wu D, Vapnik V N. Support vector machines for spam categorization [J]. IEEE Trans on Neural Networks, 1999, 10(5): 1048-1054.
  • 6Zobel J, Moffat A. Inverted files for text search engines [J]. ACM Computing Surveys, 2006, 38(2):.Article 6.
  • 7Joachims T. Training linear SVMs in linear time [C] //Proc of the 12th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2006:217-226.
  • 8Paul G. Better Bayesian filtering [C/OL] //Proc of the 2003 Spam Conf. 2003. [2010-01-01]. http://www, paulgraham. com/better, html.
  • 9Sculley D, Wachman G M. Relaxed online SVMs in the TREC spam filtering track [C] //Proc of the 16th Text Retrieval Conf. Gaithersburg: NIST, 2007.
  • 10Cormack G V, Lynam T. TREC 2005 spam track overview [C] //Proc of the 14th Text Retrieval Conf. Gaithersburg: NIST, 2005.

共引文献11

同被引文献29

引证文献2

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部