摘要
【目的】为提升传统LDA模型的主题识别性能,并给主题最优数目选择提供技术方案,提出基于自适应聚类的K-wrLDA模型。【方法】利用LDA和Word2Vec模型得出包含主题词概率信息及词义相关性的T-WV矩阵,并将传统LDA模型的主题数目选择问题转化为聚类效果评价问题,以内部指标伪F统计量作为目标函数,计算主题聚类数目的最优解,并对新旧两种模型的主题识别效果进行比较。【结果】经自适应聚类得出最优主题数量为33,且新模型的困惑度得分始终低于传统模型,主题识别效果对比显示新模型具有更好的凝聚性。【局限】在实证语料选取上获取单一主题下的科技文献,数据量不大。【结论】新模型具有更理想的主题识别能力,并能够自主计算最优主题数目。该模型作为对传统LDA模型的改进,可以应用于各领域的大规模语料中。
[Objective] This paper proposes a K-wrLDA model based on adaptive clustering, aiming to improve the subject recognition ability of traditional LDA model, and identify the optimal number of selected topics. [Methods] First, we used the LDA and word2 vec models to construct the T-WV matrix containing the probability information and the semantic relevance of the subject words. Then, we selected the number of topics based on the evaluation of clustering effects and the pseudo-F statistic. Finally, we compared the topic identification results of the proposed model with the old ones. [Results] The optimal number of topics was 33 for the proposed model, which also has lower level of perplexity than the traditional ones. [Limitations] The sample size needs to be expanded. [Conclusions] The proposed model, which has better recognition rate than the traditional LDA model, could also calculate the optimal number of topics. The new model may be applied to process large corpus in various fields.
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2018年第1期29-40,共12页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金项目"基于LDA模型的‘海上丝绸之路’文本挖掘研究"(项目编号:15CTJ005)的研究成果之一
关键词
主题模型
词嵌入
自适应聚类
困惑度
Topic Model
Word Embedding
Adaptive Clustering
Perplexity