期刊文献+

结合半监督学习和LDA模型的文本分类方法 被引量:7

Text categorization scheme based on semi-supervised learning and latent Dirichlet allocation model
下载PDF
导出
摘要 针对样本集中具有较少标记样本情况下的文本分类问题,提出一种结合半监督学习(SSL)和隐含狄利克雷分配(LDA)主题模型的标记样本扩展方法(SSL-LDA),并整合朴素贝叶斯(NB)分类器构建一种文本分类方法。使用LDA主题模型生成主题分布,以表示所有样本;根据训练集中已标记样本,通过一种简化粒子群优化(SPSO)算法获得SSL-LDA自训练模型的最优参数;基于SSL-LDA自训练模型对训练集中一些未标记样本进行标记,扩展训练集;基于扩展后的训练集,训练NB文本分类器。在3个数据集上的实验结果表明,该方法能够很好地应对标记样本较少的情况,获得了较高的分类精确度。 For the text classification problem of fewer labeled samples in the sample set,a labeled sample extension method(SSL-LDA)combining the semi-supervised learning(SSL)and the latent Dirichlet distribution(LDA)topic model was proposed,and naive Bayesian(NB)classifier was integrated to construct a text categorization method.The LDA topic model was used to gene-rate a topic distribution to represent all samples.The optimal parameters of the SSL-LDA self-training model were obtained using a simplified particle swarm optimization(SPSO)algorithm according to the labeled samples in training set.The SSL-LDA self-training model was used to label some unlabeled samples in the training set.The NB text classifier was trained based on the expanded training set.Experimental results on three datasets show that the proposed method can deal with the less labeled samples and obtain high classification accuracy.
作者 韩栋 王春华 肖敏 HAN Dong;WANG Chun-hua;XIAO Min(School of Information Engineering,Huanghuai University,Zhumadian 463000,China;School of Computer Science and Technology,Wuhan University of Technology,Wuhan 430063,China)
出处 《计算机工程与设计》 北大核心 2018年第10期3265-3271,共7页 Computer Engineering and Design
基金 河南省科技厅科技计划基金项目(172102210117) 河南省驻马店市科技计划基金项目(17135)
关键词 文本分类 半监督学习 LDA主题模型 简化粒子群优化 标记样本扩展 text categorization semi-supervised learning latent Dirichlet allocation model simplified particle swarm optimization labeled samples extension
  • 相关文献

参考文献6

二级参考文献57

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:386
  • 2Steyvers M, Griffiths T. Probabilistic topic models. Handbook of Latent Semantic Analysis, 2007,427(7) :424-440.
  • 3Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. The Journal of Machine Learning Research, 2003,3 : 993 - 1022.
  • 4Mimno D, McCallum A. Topic models conditioned on arbitrary features with dirichlet- multinomial regression. Proceedings of the 24th Annual Conference on Uncertainty in ArtificialIn- telligence, Helsinki, Finland, 2008.
  • 5Kim H, Sun Y, Hockenmaier J, et al. ETM: Entity topic models for mining documents associated with entities. 2012 IEEE 12tu International Conference on Data Mining. IEEE, 2012:349-358.
  • 6Blei D M, McAuliffe J D. Supervised topic models. Advances in Neural Information Processing Systems (NIPS), 2007.
  • 7Ramage D, Hall D, Nallapati R, et aZ. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. Proceedings of the 2009.Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1. Association for Computational Linguistics, 2009 : 248 - 256.
  • 8Ramage D, Manning C D, Dumais S. Partially labeled topic models for interpretable text mining. Proceedings of the 17'h ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM,2011:457-465.
  • 9Hofmann T. Probabilistic latent semantic analysis. Proceedings of the 15^th conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc. , 1999 : 289-296.
  • 10Palmer J, Wipf D, Kreutz-Delgado K, et al. Variational EM lgorithms for non Gaussian latent variable models. Advances in Neural Information Processing Systems, 2006,18 : 1059.

共引文献51

同被引文献104

引证文献7

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部