摘要
【目的】通过融合BERT词向量、五笔特征、领域同义词表信息以及字频特征于BiLSTM-CRF模型,实现对政府工作报告语料集的关键词自动提取。【方法】利用BERT向量和五笔向量捕捉输入序列的语义特征和字形特征,通过融合针对政府工作报告所构建的领域同义词表,捕捉输入序列的类别特征,并进一步将字频特征作为权重值赋值于词向量捕捉输入序列上下文特征,使BiLSTM-CRF模型捕捉到更多的语义信息,实现对政府工作报告的关键词自动提取。【结果】基于多特征融合的关键词提取方法,在自建的政府工作报告语料库上,准确率、召回率和F1值分别达到86.14%、91.56%以及88.42%。此外,通过消融实验评估了方法中各特征的有效性。【局限】模型针对政府工作报告领域取得了较好的结果,在之后的工作中需要提高模型的泛化能力。【结论】基于多特征融合的关键词提取方法与其他关键词提取基线方法相比,具有更好的提取效果。
[Objective] This paper proposes a modified BiLSTM-CRF model to automatically extract keywords from the government work reports with the help of BERT word vector, Wubi features, domain synonyms, and word frequencies. [Methods] First, we used the BERT and Wubi vectors to capture the semantic and font features of the input sequence. Then, we captured the category features of the input sequence with the domain synonym table for the government work reports. Third, we assigned the word frequency features as weight to the word vector to capture context features of input sequence. Finally, we used the BiLSTM-CRF model to retrieve more semantic information and automatically extract keywords from government work reports. [Results] We examined the proposed model on the self-built corpus of government work reports. The precision, recall and F1 values reached 86.14%, 91.56%, and 88.42%. We also evaluated the validity of each feature in the model with the ablation experiment. [Limitations] More research is needed to utilize the model to other texts. [Conclusions] The proposed method could effectively extract keywords from Chinese texts.
作者
潘慧萍
李宝安
张乐
吕学强
Pan Huiping;Li Baoan;Zhang Le;Lv Xueqiang(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2022年第5期54-63,共10页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目(项目编号:62171043)
国家语言文字工作委员会重点项目(项目编号:ZDI145-10)的研究成果之一。
关键词
提取
政府工作报告
BERT
五笔
字频
Keyword Extraction
Government Work Report
BERT
Wubi
Word Frequency