期刊文献+

基于BERT的多特征融合农业命名实体识别 被引量:26

Recognition of the agricultural named entities with multi-feature fusion based on BERT
下载PDF
导出
摘要 命名实体识别是农业文本信息抽取的重要环节,针对实体识别过程中局部上下文特征缺失、字向量表征单一、罕见实体识别率低等问题,提出一种融合BERT(Bidirectional Encoder Representations from Transformers,转换器的双向编码器表征量)字级特征与外部词典特征的命名实体识别方法。通过BERT预训练模型,融合左右两侧语境信息,增强字的语义表示,缓解一词多义的问题;自建农业领域词典,引入双向最大匹配策略,获取分布式词典特征表示,提高模型对罕见或未知实体的识别准确率;利用双向长短时记忆(Bi-directional Long-short Term Memory,BiLSTM)网络获取序列特征矩阵,并通过条件随机场(Conditional Random Field,CRF)模型生成全局最优序列。结合领域专家知识,构建农业语料集,包含5295条标注语料,5类农业实体。模型在语料集上准确率为94.84%、召回率为95.23%、F_(1)值为95.03%。研究结果表明,该方法能够有效识别农业领域命名实体,识别精准度优于其他模型,具有明显的优势。 Agricultural named entity recognition is a fundamental task for information extraction in the agricultural domain.Aiming at the problems of local context features、unable to solve the polysemy of the word、low recognition rate of rare entities in the process of entity recognition,the model combined with character level features and dictionary feature was proposed to automatically identify entities in the text,the character level features were obtained from the BERT(Bidirectional Encoder Representations from Transformers)model.Firstly,the BERT pre-trained language model was used to integrate the left and right contextual information to obtain the character level features,enhance the semantic representation of words,in order to alleviate the problem of polysemy;Secondly,we built an agricultural dictionary and introduced external dictionary information through the feature extraction strategy to improve the recognition accuracy of the model for rare or unknown entities.Among them,two feature extraction strategies were designed to capture the dictionary features,included N-gram feature template algorithm and bi-direction maximum matching algorithm.Then,the character level features and dictionary features were fused as the input of the next neural network layer.Finally,the fused feature information were encoded by the BiLSTM(Bi-directional Long-short Term Memory)neural network layer,obtained the sequence feature matrix,and the optimal text label sequence was obtained by CRF(Conditional Random Field).Based on the knowledge of domain experts,a labeling strategy of named entities in the agricultural field was proposed to solve the problem of fuzzy boundaries of agricultural named entities,in order to ensure the integrity of the entities.The experiments were carried out on the corpus of agricultural,which contained 5295 labeled corpora and 5 categories of agricultural entities.The results showed that better overall performance was achieved in the corpus,where the recognition precision,recall,and F^(1)-score were 94.84%,95.23%,and 95.03%,respectively.In terms of specific categories,due to the obvious boundary characteristics of crop diseases and pesticide,the model achieved higher recognition precision than the remaining three entities of agricultural,such as machinery,pests,and crop variety.Experimental comparison showed that for the effectiveness of the dictionary feature extraction strategy,the performance of the model based on the bi-direction maximum matching algorithm was better than the N-gram feature template algorithm.When the number of templates was 10,the performance of the model based on N-gram feature template was the best with the recognition precision of93.95%and F_(1)-score of 94.03%.The bi-directional maximum matching algorithm using feature embedding can obtain more potential information,which was better than one-hot encoding.The precision and F^(1)-score of the model were improved by 0.49 and 0.91 percentage points,respectively.Compared with the models based on BiLSTM-CRF,BERT-BiLSTM-CRF,the precision of the BERT-Dic-BiLSTM-CRF model proposed in this paper had obvious performance advantages with the highest recognition precision of 94.84%.Compared with the BERT-BiLSTM-CRF model,for the recognition performance of rare or unknown entities,the recognition precision of the BERT-Dic-BiLSTM-CRF model was improved by 5.93 and 6.44 percentage points,respectively.Further verifying that the integration of dictionary features into the model can improve the recognition accuracy of the model for such entities.
作者 赵鹏飞 赵春江 吴华瑞 王维 Zhao Pengfei;Zhao Chunjiang;Wu Huarui;Wang Wei(School of Engineering,Shanxi Agricultural University,Taigu 030801,China;National Engineering Research Center for Information Technology in Agriculture,Beijing 100097,China;Beijing Research Center for Information Technology in Agriculture,Beijing 100097,China;Beijing Research Center of Intelligent Equipment for Agriculture,Beijing 100097,China)
出处 《农业工程学报》 EI CAS CSCD 北大核心 2022年第3期112-118,共7页 Transactions of the Chinese Society of Agricultural Engineering
基金 国家重点研发计划项目(2019YFD1101105) 国家自然科学基金项目(61871041) 北京市科技计划项目(Z191100004019007)。
关键词 农业 命名实体识别 文本 BERT 词典特征 BiLSTM agriculture named entity recognition text BERT dictionary feature BiLSTM
  • 相关文献

参考文献13

二级参考文献91

共引文献372

同被引文献308

引证文献26

二级引证文献65

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部