摘要
【目的】有效确定科技情报分析中LDA主题模型的最优主题数目。【方法】利用主题相似度度量潜在主题之间的差异,同时结合困惑度提出一种确定LDA最优主题数目的方法,该方法既考虑主题抽取效果同时也考虑模型对新文档的泛化能力。【结果】获取国内新能源领域的科技文献作为数据集,实证结果表明本文提出的最优LDA主题数确定方法与单纯使用困惑度相比,具有更高的主题抽取查准率(91.67%)、F值(86.27%)及科技文献推荐精度(71.25%)。【局限】未针对其他类型的数据集进行新方法的验证,如微博短文本、XML文档等。【结论】本文方法能够有效地从科技文献数据集中抽取辨识度较高的主题,并能够提高科技文献推荐效果。
[Objective] This paper tries to identify the optimal number of topics for the Latent Dirichlet Allocation (LDA) model to analyze scientific and technical information. [Methods] First, we used the topic similarity to measure the differences among the latent topics. Second, we proposed a method determining the optimal topic numbers and tried to utilize this model to documents from Chinese literature in the field of new energy. [Results] The proposed method achieved higher precision ratio and higher F-score in topic extration, which improved the performance of literature recommendation systems. [Limitations] We did not examine the new mothod with other datasets, such as microblog posts and XML documents. [Conclusions] The proposed method could identify more recognizable topics and improve the performance of scientific and technical literature recommendation systems.
作者
关鹏
王曰芬
Guan Peng Wang Yuefen(School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China College of Applied Mathematics, Chaohu University, Hefei 238000, China)
出处
《现代图书情报技术》
CSSCI
2016年第9期42-50,共9页
New Technology of Library and Information Service
基金
国家自然科学基金研究项目“新研究领域科学文献传播网络生长及对传播效果影响研究”(项目编号:71373124)
国家社会科学基金重点项目“大数据环境下社会舆情与决策支持方法体系研究”(项目编号:14AZD084)
江苏高校哲学社会科学重点研究基地(培育点)“社会计算与舆情分析”的研究成果之一
关键词
LDA主题模型
相似度
困惑度
科技情报分析
LDA
Topic model
Similarity
Perplexity
Analysis of Scientific and Technical Information