期刊文献+

时间加权的TF-LDA学术文献摘要主题分析 被引量:4

A Thematic Analysis Method of Academic Documents Based on TF-IDF and LDA
下载PDF
导出
摘要 随着网络的发展,主题提取的应用越来越广泛,尤其是学术文献的主题提取。尽管学术文献摘要是短文本,但其具有高维性的特点导致文本主题模型难以处理,其时效性的特点致使主题挖掘时容易忽略时间因素,造成主题分布不均、不明确。针对此类问题,提出一种基于TTF-LDA(time+tf-idf+latent Dirichlet allocation)的学术文献摘要主题聚类模型。通过引入TF-IDF特征提取的方法,对摘要进行特征词的提取,能有效降低LDA模型的输入文本维度,融合学术文献的发表时间因素,建立时间窗口,限定学术文献主题分析的时间,并通过文献的发表时间增加特征词的时间权重,使用特征词的时间权重之和协同主题引导特征词词库作为LDA的影响因子。通过在爬虫爬取的数据集上进行实验,与标准的LDA和MVC-LDA相比,在选取相同的主题数的情况下,模型的混乱程度更低,主题与主题之间的区分度更高,更符合学术文献本身的特点。 With the development of network,topic extraction has been applied more and more widely,especially in academic literature.Although abstracts of academic literature are short texts,their high dimensionality makes it difficult to deal with text topic models,and their timeliness makes it easy to ignore the time factor in topic mining,resulting in uneven and unclear topic distribution.In order to solve these problems,a topic clustering model of academic literature abstracts based on TTF-LDA(tf-idf+latent Dirichlet allocation)is proposed.By introducing TF-IDF feature extraction method to extract feature words from abstracts,the extraction of feature words in the abstract can effectively reduce the input text dimension of LDA model,integrate the publication time factor of academic literature,establish a time window,and limit the time of subject analysis of academic literature.The time weights of feature words are increased by the publication time of documents,and the time weights of feature words are combined with the collaborative topics to guide the feature lexicon as the influencing factors of LDA.Through experiments on data sets crawled by crawlers,compared with standard LDA and MVCLDA,the chaotic degree of the model is lower when the number of topics is the same,and the distinction between topics is higher,which is more in line with the characteristics of academic literature itself.
作者 伍哲 杨芳 WU Zhe;YANG Fang(School of Computer Science,Xi'an University of Posts and Telecommunications,Xi'an 710121,China)
出处 《计算机技术与发展》 2020年第1期194-200,共7页 Computer Technology and Development
基金 陕西省教育专项科研计划项目(15JK1679) 西安市科技创新引导项目(201805040YD18CG24(7))
关键词 LDA 主题模型 学术文献 TF-IDF 时间因素 LDA thematic model academic literature TF-IDF time factor
  • 相关文献

参考文献12

二级参考文献178

共引文献553

同被引文献100

引证文献4

二级引证文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部