期刊文献+

基于主题模型的垃圾邮件过滤系统的设计与实现 被引量:3

Design and implementation of spam filtering system based on topic model
下载PDF
导出
摘要 垃圾邮件过滤技术在保证信息安全、提高资源利用、分拣信息数据等方面都发挥着重要作用。然而,垃圾邮件的出现影响了用户的体验,并且会造成不必要的经济与时间损失。针对现有的垃圾邮件过滤技术的不足,基于多个主题词理论,构建了基于朴素贝叶斯的垃圾邮件分类方法。在邮件主题获取中,采用主题模型LDA得到邮件的相关主题及主题词;并进一步采用Word2Vec寻找主题词的同义词和关联词,扩展主题词集合。在邮件分类中,对训练数据集进行统计学习得到词语的先验概率;基于扩展的主题词集合及其概率,通过贝叶斯公式推导得到某个主题和某封邮件的联合概率,以此作为垃圾邮件判定的依据。同时,基于主题模型的垃圾邮件过滤系统具有简洁易应用的特点。通过与其他典型垃圾邮件过滤方法的对比实验,证明基于主题模型的垃圾邮件分类方法及基于Word2Vec的改进方法均能有效提高垃圾邮件过滤的准确度。 Spam filtering technology plays a key role in many areas including information security, transmission efficiency, and automatic information classification. However, the emergence of spam affects the user's sense of experience, and can cause unnecessary economic and time loss. The deficiency of spam filtering technology was researched, and a method of spam classification based on naive Bayesian was put forward based on multiple keywords. In the subject of mail, the theme model was used by LDA to get the related subject and keyword of the message, and Word2 Vec was further used to search keyword synonyms and related words, extending the keyword collection. In the classification of mails, the transcendental probability of the words in the training dataset was obtained by statistical learning. Based on the extended keyword collection and its probability, the joint probability of a subject and a message was deduced by the Bayesian formula as a basis for the spam judgment. At the same time, the spam filtering system based on topic model was simple and easy to apply. By comparing experiments with other typical spam filtering method, it is proved that the method of spam classification based on theme model and the improved method based on Word2 Vec can effectively improve the accuracy of spam filtering.
作者 寇晓淮 程华
出处 《电信科学》 北大核心 2017年第11期73-82,共10页 Telecommunications Science
关键词 文本分类 垃圾邮件 主题模型 贝叶斯原理 text classification, spare, topic model, Bayesian theory
  • 相关文献

参考文献5

二级参考文献62

共引文献31

同被引文献38

引证文献3

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部