期刊文献+

通信垃圾文本识别的半监督学习优化算法

Semi-supervised Learning Optimization Algorithm for Communication Spam Text Recognition
下载PDF
导出
摘要 在对非平衡通信文本使用随机下采样来提高分类器性能时,为了解决随机下采样样本发生有偏估计的问题,提出基于否定选择密度聚类的下采样算法(NSDC-DS)。利用否定选择算法的自体异常检测机制改善传统聚类,将样本中心点和待聚类样本分别作为检测器和自体集,对两者进行异常匹配;使用否定选择密度聚类算法对样本相似性进行评估,改进传统的下采样方法,使用NBSVM分类器对采样后的通信样本进行垃圾识别;使用PCA对样本所具有的信息量进行评估,提出改进的PCA-SGD算法对模型参数进行调优,完成通信垃圾文本的半监督识别任务。为了验证改进算法的优越性,使用不平衡通信文本等多个数据集,在否定选择密度聚类、NSDC-DS算法、PCASGD与传统模型上进行对比分析。实验结果表明,改进的模型不仅具有较好的通信垃圾文本识别能力,而且具有较快和稳定的收敛速度。 In order to solve the problem of biased estimation of random samples,when using random under-sampling to improve the classifier performance for unbalanced communication samples,a Down-Sampling algorithm based on Negative Selection Density Clustering(NSDC-DS)is proposed.Firstly,the autogenous anomaly detection mechanism of negative selection algorithm is used to improve the traditional clustering,and the two are matched abnormally.The sampled communication samples are recognized with the NBSVM classifier.Then the negative selection clustering algorithm is used to evaluate the similarity of samples and improve the traditional down-sampling method.Finally,PCA is used to evaluate the information content of samples,and an improved PCA-SGD algorithm is proposed to tune model parameters and complete the semi-supervised recognition task of communication spam text.In order to verify the superiority of the improved algorithm,multiple data sets such as unbalanced communication text are used to compare and analyze the negative selection cluster,NSBC-US,PCA-SGD and the traditional model.Experimental results show that the improved model not only has good communication spam text recognition ability,but also has fast and stable convergence speed.
作者 邱宁佳 沈卓睿 王辉 王鹏 QIU Ningjia;SHEN Zhuorui;WANG Hui;WANG Peng(School of Computer Science and Technology,Changchun University of Science and Technology,Changchun 130022,China)
出处 《计算机工程与应用》 CSCD 北大核心 2020年第17期121-128,共8页 Computer Engineering and Applications
基金 吉林省科技发展计划技术攻关项目(No.20190302118GX) 吉林省教育厅“十三五”科学技术项目(No.JJKH20190601KJ)。
关键词 非平衡数据 垃圾文本识别 否定选择密度聚类 基于否定选择密度聚类的下采样算法(NSDC-DS) 基于主成分分析的随机梯度下降(PCA-SGD)算法 unbalanced data spam text recognition negative selection density clustering Down-Sampling algorithm based on Negative Selection Density Clustering(NSDC-DS) Stochastic Gradient Descent based on Principal Component Analysis(PCA-SGD)algorithm
  • 相关文献

参考文献8

二级参考文献61

  • 1张亚亚,郭华平,范明.一种利用类标号关系的多类标号分类方法[J].计算机研究与发展,2011,48(S3):16-21. 被引量:1
  • 2吴洪兴,彭宇,彭喜元.适用于不平衡样本数据处理的支持向量机方法[J].电子学报,2006,34(B12):2395-2398. 被引量:16
  • 3李晓东,曾光明,蒋茹,李峰,石林,梁婕,韦安磊,黄国和.改进支持向量机对污水处理厂运行状况的故障诊断[J].湖南大学学报(自然科学版),2007,34(12):68-71. 被引量:6
  • 4He H, Garcia E A. Learning from imbalanced data [J]. IEEE Transactions on Knowledge and Data Engineering, 2009,21 (9) : 1263-1284.
  • 5Chan P K,Stolfo S J. Toward Scalable Learning with NomUni- form Class and Cost Distributions:A Case Study in Credit Card Fraud Detection[C]//KDD. 1998:164-168.
  • 6Kuhat M, Holte R C,Matwin S. Machine learning for the detec- tion of oil spills in satellite radar images[J]. Machine learning, 1998,30(2/3) : 195-215.
  • 7Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of artificial intelli- gence research, 2002,16(1) : 321-357.
  • 8Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[M]//Ad- vances in intelligent computing. Springer Berlin Heidelberg, 2005:878-887.
  • 9Kubat M,Matwin S. Addressing the curse of imbalanced train- ing sets.- one-sided seleetion[C]//ICML. 1997 179-186.
  • 10Yen S J,Lee Y S. Cluster-based under-sampling approaches for imbalanced data distributions[J]. Expert Systems with Applica- tions, 2009,36(3) : 5718-5727.

共引文献99

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部