期刊文献+

科技情报分析中LDA主题模型最优主题数确定方法研究 被引量:124

Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model
原文传递
导出
摘要 【目的】有效确定科技情报分析中LDA主题模型的最优主题数目。【方法】利用主题相似度度量潜在主题之间的差异,同时结合困惑度提出一种确定LDA最优主题数目的方法,该方法既考虑主题抽取效果同时也考虑模型对新文档的泛化能力。【结果】获取国内新能源领域的科技文献作为数据集,实证结果表明本文提出的最优LDA主题数确定方法与单纯使用困惑度相比,具有更高的主题抽取查准率(91.67%)、F值(86.27%)及科技文献推荐精度(71.25%)。【局限】未针对其他类型的数据集进行新方法的验证,如微博短文本、XML文档等。【结论】本文方法能够有效地从科技文献数据集中抽取辨识度较高的主题,并能够提高科技文献推荐效果。 [Objective] This paper tries to identify the optimal number of topics for the Latent Dirichlet Allocation (LDA) model to analyze scientific and technical information. [Methods] First, we used the topic similarity to measure the differences among the latent topics. Second, we proposed a method determining the optimal topic numbers and tried to utilize this model to documents from Chinese literature in the field of new energy. [Results] The proposed method achieved higher precision ratio and higher F-score in topic extration, which improved the performance of literature recommendation systems. [Limitations] We did not examine the new mothod with other datasets, such as microblog posts and XML documents. [Conclusions] The proposed method could identify more recognizable topics and improve the performance of scientific and technical literature recommendation systems.
作者 关鹏 王曰芬 Guan Peng Wang Yuefen(School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China College of Applied Mathematics, Chaohu University, Hefei 238000, China)
出处 《现代图书情报技术》 CSSCI 2016年第9期42-50,共9页 New Technology of Library and Information Service
基金 国家自然科学基金研究项目“新研究领域科学文献传播网络生长及对传播效果影响研究”(项目编号:71373124) 国家社会科学基金重点项目“大数据环境下社会舆情与决策支持方法体系研究”(项目编号:14AZD084) 江苏高校哲学社会科学重点研究基地(培育点)“社会计算与舆情分析”的研究成果之一
关键词 LDA主题模型 相似度 困惑度 科技情报分析 LDA Topic model Similarity Perplexity Analysis of Scientific and Technical Information
  • 相关文献

参考文献8

二级参考文献168

  • 1Hristovski D,Friedman C,Rindflesch T C,et al.Literat-ure-Based Knowledge Discovery using Natural Language Processing[J].Literature-based Discovery,Information Science and Knowledge Management,2008(15):133-152.
  • 2Sayyadi H,Getoor L.FutureRank:Ranking Scientific Articles by Predicting their Future PageRank[C] //Proceedings of the 9th SIAM International Conference on Data Mining,2009:533-544.
  • 3Blei D M,Ng A Y,Jordan M I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003(3):993-1022.
  • 4Erosheva E,Fienberg S,Lafferty J.Mixed-membership Models of Scientific Publications[C] //Proceedings of the National Academy of Sciences,2004(101):5220-5227.
  • 5Nallapati R M,Ahmed A,Xing E P,et al.Joint Latent Topic Models for Text and Citations[C] //Proceeding of the 14th international conference on Knowledge Discovery and Data Mining,2008:542-550.
  • 6Blei D M,Lafferty J D.Dynamic Topic Model[C] //Proceedings of the 23rd international conference on Machine Learning,2006(48):113-120.
  • 7Wang X,McCallum A.Topics over Time:a non-Markov Continuous-time Model of Topical Trends[C] //Proceedings of the 12th international conference on Knowledge Discovery and Data Mining,2006:424-433.
  • 8Rosen-Zvi M,Griffths T,Steyvers M,et al.The Author-Topic Model for Authors and Documents[C] //Proceedings of the 20th conference on Uncertainty in artificial intelligence,2004:487-494.
  • 9Griffiths T L,Steyvers M.Finding Scientific Topics[C] //Proceedings of the National Academy of Sciences of the United States of America,2004(101):5228-5235.
  • 10Steyvers M,Griffiths T.Probabilistic Topic Models//Landauer T,McNamara D,Dennis S,et al.Handbook of Latent Semantic Analysis[M].Laurence Erlbaum:Psychology Press,2007.

共引文献219

同被引文献1368

引证文献124

二级引证文献679

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部