摘要
针对样本集中具有较少标记样本情况下的文本分类问题,提出一种结合半监督学习(SSL)和隐含狄利克雷分配(LDA)主题模型的标记样本扩展方法(SSL-LDA),并整合朴素贝叶斯(NB)分类器构建一种文本分类方法。使用LDA主题模型生成主题分布,以表示所有样本;根据训练集中已标记样本,通过一种简化粒子群优化(SPSO)算法获得SSL-LDA自训练模型的最优参数;基于SSL-LDA自训练模型对训练集中一些未标记样本进行标记,扩展训练集;基于扩展后的训练集,训练NB文本分类器。在3个数据集上的实验结果表明,该方法能够很好地应对标记样本较少的情况,获得了较高的分类精确度。
For the text classification problem of fewer labeled samples in the sample set,a labeled sample extension method(SSL-LDA)combining the semi-supervised learning(SSL)and the latent Dirichlet distribution(LDA)topic model was proposed,and naive Bayesian(NB)classifier was integrated to construct a text categorization method.The LDA topic model was used to gene-rate a topic distribution to represent all samples.The optimal parameters of the SSL-LDA self-training model were obtained using a simplified particle swarm optimization(SPSO)algorithm according to the labeled samples in training set.The SSL-LDA self-training model was used to label some unlabeled samples in the training set.The NB text classifier was trained based on the expanded training set.Experimental results on three datasets show that the proposed method can deal with the less labeled samples and obtain high classification accuracy.
作者
韩栋
王春华
肖敏
HAN Dong;WANG Chun-hua;XIAO Min(School of Information Engineering,Huanghuai University,Zhumadian 463000,China;School of Computer Science and Technology,Wuhan University of Technology,Wuhan 430063,China)
出处
《计算机工程与设计》
北大核心
2018年第10期3265-3271,共7页
Computer Engineering and Design
基金
河南省科技厅科技计划基金项目(172102210117)
河南省驻马店市科技计划基金项目(17135)
关键词
文本分类
半监督学习
LDA主题模型
简化粒子群优化
标记样本扩展
text categorization
semi-supervised learning
latent Dirichlet allocation model
simplified particle swarm optimization
labeled samples extension