期刊文献+

不同长度下中文垃圾邮件分类模型的研究 被引量:1

Performance and Selection of Chinese Spam Classification Model Under Different Lengths
下载PDF
导出
摘要 针对日益泛滥的垃圾邮件问题,本文使用多种算法对不同长度下中文垃圾邮件分类模型进行比较研究。首先,使用朴素贝叶斯算法对邮件数据集进行训练和测试;然后,从邮件数据集中筛选出三种不同文本长度的数据集和两种不同大小样本量的数据集,组成五个实验样本集;最后分别使用多种传统机器学习模型、神经网络模型和预训练模型在五个实验样本集上进行建模比较。实验结果表明,预训练模型ALBERT最适合分类句子长度的中文垃圾邮件,传统机器学习模型SVM最适合分类段落长度的中文垃圾邮件,神经网络模型TextRCNN最适合分类篇章长度的中文垃圾邮件。实验结果还显示,神经网络模型TextRNN和预训练模型RoBERTa不适用于小样本数据。 In response to the increasingly widespread spam problem,this paper uses a variety of algorithms to compare Chinese spam classification models with different lengths.Firstly,use the naive Bayes algorithm to train and test the mail dataset.Then,three datasets with different text lengths and two datasets with different sample sizes were screened out from the email dataset to form five experimental sample sets.Finally,a variety of traditional machine learning models,neural network models and pre-trained models are used to model and compare on five experimental sample sets.The experimental results show that the pre-trained model ALBERT is best for classifying Chinese spam with sentence length,the traditional machine learning model SVM is best for classifying Chinese spam with paragraph length,and the neural network model TextRCNN is best for classifying Chinese spam with text length.The experimental results also show that the neural network model TextRNN and the pre-trained model RoBERTa are not suitable for small sample data.
作者 顾孟钧 冯文舟 陈中兵 Gu Mengjun;Feng Wenzhou;Chen Zhongbing(China Telecom Zhejiang Brach,Hangzhou Zhejiang,310000;Public Security Bureau of Linhai City,Taizhou Zhejiang,318000;Zhejiang Public Information Industry Co.,Ltd,Hangzhou Zhejiang,310000)
出处 《工业信息安全》 2022年第7期28-35,共8页 Industry Information Security
关键词 中文垃圾邮件 文本分类 机器学习 深度学习 Chinese Spam Text Classification Machine Learning Deep Learning
  • 相关文献

参考文献3

二级参考文献30

共引文献19

同被引文献17

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部