期刊文献+

LDA模型的优化及其主题数量选择研究——以科技文献为例 被引量:35

Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature
原文传递
导出
摘要 【目的】为提升传统LDA模型的主题识别性能,并给主题最优数目选择提供技术方案,提出基于自适应聚类的K-wrLDA模型。【方法】利用LDA和Word2Vec模型得出包含主题词概率信息及词义相关性的T-WV矩阵,并将传统LDA模型的主题数目选择问题转化为聚类效果评价问题,以内部指标伪F统计量作为目标函数,计算主题聚类数目的最优解,并对新旧两种模型的主题识别效果进行比较。【结果】经自适应聚类得出最优主题数量为33,且新模型的困惑度得分始终低于传统模型,主题识别效果对比显示新模型具有更好的凝聚性。【局限】在实证语料选取上获取单一主题下的科技文献,数据量不大。【结论】新模型具有更理想的主题识别能力,并能够自主计算最优主题数目。该模型作为对传统LDA模型的改进,可以应用于各领域的大规模语料中。 [Objective] This paper proposes a K-wrLDA model based on adaptive clustering, aiming to improve the subject recognition ability of traditional LDA model, and identify the optimal number of selected topics. [Methods] First, we used the LDA and word2 vec models to construct the T-WV matrix containing the probability information and the semantic relevance of the subject words. Then, we selected the number of topics based on the evaluation of clustering effects and the pseudo-F statistic. Finally, we compared the topic identification results of the proposed model with the old ones. [Results] The optimal number of topics was 33 for the proposed model, which also has lower level of perplexity than the traditional ones. [Limitations] The sample size needs to be expanded. [Conclusions] The proposed model, which has better recognition rate than the traditional LDA model, could also calculate the optimal number of topics. The new model may be applied to process large corpus in various fields.
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2018年第1期29-40,共12页 Data Analysis and Knowledge Discovery
基金 国家社会科学基金项目"基于LDA模型的‘海上丝绸之路’文本挖掘研究"(项目编号:15CTJ005)的研究成果之一
关键词 主题模型 词嵌入 自适应聚类 困惑度 Topic Model Word Embedding Adaptive Clustering Perplexity
  • 相关文献

参考文献8

二级参考文献116

  • 1段瑞雪,王小捷,孙月萍,李文峰.HDP主题模型的用户意图聚类[J].北京邮电大学学报,2011,34(S1):55-58. 被引量:6
  • 2郭炜强,戴天,文贵华.基于领域知识的专利自动分类[J].计算机工程,2005,31(23):52-54. 被引量:17
  • 3李程雄,丁月华,文贵华.SVM-KNN组合改进算法在专利文本分类中的应用[J].计算机工程与应用,2006,42(20):193-195. 被引量:22
  • 4朱靖波,叶娜,罗海涛.基于多元判别分析的文本分割模型[J].软件学报,2007,18(3):555-564. 被引量:15
  • 5石晶,戴国忠.基于PLSA模型的文本分割[J].计算机研究与发展,2007,44(2):242-248. 被引量:25
  • 6Blei D, Ng A, Jordan M. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-1022
  • 7Blei D, Lafferty J. Correlated topic models//Weiss Y, Seholkopf B, Platt J eds. Advances in Neural Information Processing Systems 18. Cambridge, MA: MIT Press, 2006
  • 8Li W, McCallum A. Pachinko allocation: DAG-struetured mixture models of topic correlations//Proceedings of the International Conference on Machine Learning (ICML). Pittsburgh, Pennsylvania, 2006: 577-584
  • 9Xing E, Yan R, Hauptmann A. Mining associated text and images with dual-wing harmoniums//Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI-05). Edinburgh, Scotland, 2005:633-641
  • 10Li F-F, Perona P. A bayesian hierarchical model for learning natural scene categories//Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). Washington, DC, USA, 2005: 524-531

共引文献349

同被引文献420

引证文献35

二级引证文献197

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部