期刊文献+

基于多特征融合的政府工作报告关键词提取研究 被引量:2

Extracting Keywords from Government Work Reports with Multi-feature Fusion
原文传递
导出
摘要 【目的】通过融合BERT词向量、五笔特征、领域同义词表信息以及字频特征于BiLSTM-CRF模型,实现对政府工作报告语料集的关键词自动提取。【方法】利用BERT向量和五笔向量捕捉输入序列的语义特征和字形特征,通过融合针对政府工作报告所构建的领域同义词表,捕捉输入序列的类别特征,并进一步将字频特征作为权重值赋值于词向量捕捉输入序列上下文特征,使BiLSTM-CRF模型捕捉到更多的语义信息,实现对政府工作报告的关键词自动提取。【结果】基于多特征融合的关键词提取方法,在自建的政府工作报告语料库上,准确率、召回率和F1值分别达到86.14%、91.56%以及88.42%。此外,通过消融实验评估了方法中各特征的有效性。【局限】模型针对政府工作报告领域取得了较好的结果,在之后的工作中需要提高模型的泛化能力。【结论】基于多特征融合的关键词提取方法与其他关键词提取基线方法相比,具有更好的提取效果。 [Objective] This paper proposes a modified BiLSTM-CRF model to automatically extract keywords from the government work reports with the help of BERT word vector, Wubi features, domain synonyms, and word frequencies. [Methods] First, we used the BERT and Wubi vectors to capture the semantic and font features of the input sequence. Then, we captured the category features of the input sequence with the domain synonym table for the government work reports. Third, we assigned the word frequency features as weight to the word vector to capture context features of input sequence. Finally, we used the BiLSTM-CRF model to retrieve more semantic information and automatically extract keywords from government work reports. [Results] We examined the proposed model on the self-built corpus of government work reports. The precision, recall and F1 values reached 86.14%, 91.56%, and 88.42%. We also evaluated the validity of each feature in the model with the ablation experiment. [Limitations] More research is needed to utilize the model to other texts. [Conclusions] The proposed method could effectively extract keywords from Chinese texts.
作者 潘慧萍 李宝安 张乐 吕学强 Pan Huiping;Li Baoan;Zhang Le;Lv Xueqiang(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China)
出处 《数据分析与知识发现》 CSSCI CSCD 北大核心 2022年第5期54-63,共10页 Data Analysis and Knowledge Discovery
基金 国家自然科学基金项目(项目编号:62171043) 国家语言文字工作委员会重点项目(项目编号:ZDI145-10)的研究成果之一。
关键词 提取 政府工作报告 BERT 五笔 字频 Keyword Extraction Government Work Report BERT Wubi Word Frequency
  • 相关文献

参考文献15

二级参考文献72

  • 1聂卉.结合词向量和词图算法的用户兴趣建模研究[J].数据分析与知识发现,2019,3(12):30-40. 被引量:8
  • 2索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006,20(6):25-30. 被引量:88
  • 3LIU Chuan-han,WANG Yong-cheng,ZHENG Fei,LIU De-rong.Using LSA and text segmentation to improve automatic Chinese dialogue text summarization[J].Journal of Zhejiang University-Science A(Applied Physics & Engineering),2007,8(1):79-87. 被引量:3
  • 4刘佳宾,陈超,邵正荣,吉翔华.基于机器学习的科技文摘关键词自动提取方法[J].计算机工程与应用,2007,43(14):170-172. 被引量:15
  • 5Mihalcea R, Tarau P. TextRank : Bringing Order into Texts [ C ]. In: Proceedings of Empirical Methods in Natural Language Process- ing, Barcelona, Spain. 2004:404-411.
  • 6Frank E, Paynter G W, Witten I H, et al. Domain - Specific Key- phrase Extraction [ C ] In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. 1999 : 668 -673.
  • 7Turney P D. Learning Algorithms for Keyphrase Extraction[ J]. In- formation Retrieval, 2000, 2 (4) :303 - 336.
  • 8Pasquier C. Task 5 : Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation [ C ]. In : Pro- ceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA, USA : Association for Computational Linguistics, 2010:154 - 157.
  • 9Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[ J]. Journal of Machine Learning Research, 2003, 3: 993- 1022.
  • 10Page L, Brin S, Motwani R, et al. The PageRank Citation Rank- ing: Bringing Order to the Web [ R]. Stanford Digital Library Technologies Project, 1998.

共引文献231

同被引文献18

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部