摘要
[目的/意义]科学文献的主题识别研究是科研管理的重要内容之一,如何全面把握文献的多元数据、提升自动文献主题识别的效果是一个值得研究的问题。[方法/过程]文献的关键词、摘要是判断文献主题的重要依据,文章提出基于文献多元数据融合的主题识别模型,使用Word2vec模型、AP聚类及Node2vec模型表示出关键词层的主题向量,使用LDA模型表示出摘要层的主题向量,通过多视图聚类中的SGF方法进行数据融合并识别文献主题。[结果/结论]以不同规模的文献集为例,通过主题识别研究,验证该模型识别效果的准确性和可解释性优于典型LDA方法、DoC-LDA模型。
[Purpose/significance]The research on topic identification of scientific literature is one of the important contents of scientific research management.How to comprehensively grasp the multivariate data of literature and effectively improve the accuracy of automatic literature topic identification is a problem worthy of research.[Method/process] Keywords and abstracts of documents are important basis for judging document topics.This paper proposes a topic identification model based on multi-data fusion of documents.Word2vec model,AP clustering and Node2vec model are used to represent the topic vector of the keyword layer.The topic vector of the abstract layer is represented by the LDA model,and the SGF method in the multi-view clustering method is used to perform data fusion and extract document topics.[Result/conclusion]Taking document sets of different scales as an example,through topic identification research,it is verified that the accuracy and interpretability of the recognition effect of the model are better than the typical LDA method and the Doc-LDA model.
作者
邱均平
孙月瑞
周贞云
Qiu Junping;Sun Yuerui;Zhou Zhenyun(Chinese Academy of Science and Education Evaluation,Hangzhou Dianzi University,Zhejiang,310018;School of Management,Hangzhou Dianzi University,Zhejiang,310018;Academy of Data Science and Informatics,Hangzhou Dianzi University,Zhejiang,310018)
出处
《情报资料工作》
CSSCI
北大核心
2022年第6期14-20,共7页
Information and Documentation Services
基金
2019年国家社会科学基金重大项目“基于大数据的科教评价信息云平台构建和智能服务研究”(项目编号:19ZDA348)
2020年浙江省软科学研究计划重点项目“创新强省背景下浙江高校科技创新竞争力评价及提升研究”(项目编号:2020C25027)的研究成果之一。
关键词
科学文献
主题识别
数据融合
多视图聚类
多元数据
scientific literature
topici dentification
data fusion
multi-view clustering
multivariate data