期刊文献+

联合知识图谱和预训练模型的中文关键词抽取方法 被引量:2

Chinese Keyword Extraction Method Combining Knowledge Graph and Pre-training Model
下载PDF
导出
摘要 关键词表征了文本的主题,是文本概念和主题的凝练。通过关键词,读者可以快速了解文档表达的主旨和思想,从而提升信息检索效率;此外,关键词抽取也可以为自动摘要、文本分类提供支撑。近年来,自动抽取关键词的研究引起了广泛关注,但如何精准地抽取文档的关键词仍是一个挑战。一方面,关键词是人们主观的认识,判断一个词是否是关键词本身具有主观性;另一方面,中文词汇往往具有丰富的语义信息,单纯依赖传统统计特征和主题特征难以准确提炼文本所表达的主旨思想。针对中文关键词抽取中存在的准确率低、信息冗余和信息缺失等问题,提出了一种联合知识图谱和预训练模型的无监督关键词抽取方法。该方法首先利用预训练模型进行主题聚类,并通过一种以句子为单位的聚类方法保证最终选取的关键词对全文内容的覆盖度;同时,通过知识图谱进行实体链接,以此实现精准分词及歧义消除;然后,根据主题信息构建语义词图,并以此为基础计算词语间的语义权重;最后,通过加权的PageRank算法进行关键词排序。在DUC 2001和CSL两个公开数据集和一个单独标注的CLTS数据集上,以预测结果的准确率、召回率及F1值为指标进行对比实验。实验结果表明,该模型相比多种基线方法,准确率均有所提升,在CLTS数据集上与传统统计方法 TF-IDF相比F1值提高了9.14%,与传统图方法 TextRank相比F1值提高了4.82%。 Keywords represent the theme of the text, which is the condensed concept and content of the text.Through keywords, readers can quickly understand the gist and idea of the text and improve the efficiency of information retrieval.In addition, keyword extraction can also provide support for automatic text summarization and text classification.In recent years, research on automatic keyword extraction has attracted wide attention, but how to extract keywords from documents accurately remains a challenge.On the one hand, the keyword is people’s subjective understanding, judging whether a word is a keyword itself is subjective.On the other hand, Chinese words are often rich in semantic information and it is difficult to accurately extract the main idea expressed in the text by solely relying on traditional statistical features and thematic features.Aiming at the problems of low accuracy, information redundancy and information missing in Chinese keyword extraction, this paper proposes an unsupervised keyword extraction method combining knowledge graph and pre-training model.Firstly, topic clustering is carried out by using the pre-training model, and a sentence-based clustering method is proposed to ensure the coverage of the final selected keyword.Then, the knowledge graph is used for entity linking to achieve accurate word segmentation and semantic disambiguation.After that, the semantic word graph is constructed based on the topic information to calculate the semantic weight between words.Finally, keywords are sorted by the weighted PageRank algorithm.Experiments are conducted on two public datasets, DUC 2001 and CSL,and a separate annotated CLTS dataset, the prediction accuracy, recall rate and F1 score are taken as indicators in comparative experiments.Experimental results show that the accuracy of the proposed method has improved compared with other baseline methods, F1 value is increased by 9.14% compared with the traditional statistical method TF-IDF,and increased by 4.82% compared with the traditional graph method TextRank on CLTS dataset.
作者 姚奕 杨帆 YAO Yi;YANG Fan(College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China)
出处 《计算机科学》 CSCD 北大核心 2022年第10期243-251,共9页 Computer Science
基金 军事类研究生资助课题(JY2019C078)。
关键词 关键词抽取 知识图谱 句嵌入 聚类 图算法 预训练模型 Keyword extraction Knowledge graph Sentence embedding Clustering Graph-based algorithms Pre-trained model
  • 相关文献

参考文献5

二级参考文献25

共引文献126

同被引文献13

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部