期刊文献+

基于实体级遮蔽BERT与BiLSTM-CRF的农业命名实体识别 被引量:8

Named entity recognition of agricultural based entity-level masking BERT and BiLSTM-CRF
下载PDF
导出
摘要 字符的位置信息和语义信息对命名方式繁杂且名称长度较长的中文农业实体的识别至关重要。为解决命名实体识别过程中由于捕获字符位置信息、上下文语义特征和长距离依赖信息不充足导致识别效果不理想的问题,该研究提出一种基于EmBERT-BiLSTM-CRF模型的中文农业命名实体识别方法。该方法采用基于Transformer的深度双向预训练语言模型(Bidirectional Encoder Representation from Transformers,BERT)作为嵌入层提取字向量的深度双向表示,并使用实体级遮蔽策略使模型更好地表征中文语义;然后使用双向长短时记忆网络(BidirectionalLong Short-Term Memory,BiLSTM)学习文本的长序列语义特征;最后使用条件随机场(Conditional Random Field,CRF)在训练数据中学习标注约束规则,并利用相邻标签之间的信息输出全局最优的标注序列。训练过程中使用了焦点损失函数来缓解样本分布不均衡的问题。试验在构建的语料库上对农作物品种、病害、虫害和农药4类农业实体进行识别。结果表明,该研究的EmBERT-BiLSTM-CRF模型对4类农业实体的识别性能相较于其他模型有明显提升,准确率为94.97%,F1值为95.93%。 An intelligent question-answering of agricultural knowledge can be one of the most important parts of information agriculture.Among them,named entity recognition has been a key technology for intelligent question-answering and knowledge graph construction in the fields of agricultural domain.It is also a high demand for the accurate identification of named entities.Furthermore,the Chinese named entity recognition can be confined to the location and semantic information of characters,due to the long length of agricultural entity and complex naming.Therefore,it is very necessary to improve the recognition performance in the process of named entity recognition,particularly for the sufficient capture of character position,contextual semantic features,and long-distance dependency information.In this study,a novel Chinese named entity recognition of agriculture was proposed using EmBERT-BiLSTM-CRF model.Firstly,the Bidirectional Encoder Representation from Transformers(BERT)pre-trained language model was applied as the layer of word embedding.The context semantic representation of the model was then improved to alleviate the polysemy,when pre-training the depth bidirectional representation of word vectors.Secondly,the language masking of BERT was enhanced significantly,according to the characteristics of Chinese.An Entity-level Masking strategy was utilized to completely mask the Chinese entities in the sentence with the consecutive tokens.The Chinese semantics was then better represented to alleviate the bias caused by incomplete semantics.Thirdly,the Bidirectional Long Short-Term Memory Network(BiLSTM)model was adopted to learn the semantic features of long-sequence using two LSTM networks(forward and backward),considering the contextual information in both directions at the same time.The long-distance dependency information of text was then captured during this time.Finally,the Conditional Random Field(CRF)was used to learn the labelling constraint in the training data.Among them,the learned constraint rules were used to detect whether the label sequence was legal during prediction.After that,the CRF also utilized the information of adjacent labels to output the globally optimal label sequence.Thus,the output of the model was a dependent label sequence,but an optimal sequence was considered the rules and order.A focal loss function was also used to alleviate the unbalanced sample distribution.A series of experiments were performed to construct the corpus of named entity recognition.As such,the corpus contained a total of 29790 agricultural entities after BIO labelling,including 11057 crops,8121 pesticides,4505 diseases,and 6107 pest entities,in which the training,validation,and test set were divided,according to the ratio of 7:2:1.Four types of agricultural entities from the text were identified,including the crop varieties,pesticides,diseases,and insect pests,and then to label them.The experimental results show that the recognition accuracy of the EmBERT-BiLSTM-CRF model for the four types of entities was 94.97%,and the F1 score was 95.93%.Which compared with the models based on BiLSTM-CRF and BERT-BiLSTM-CRF,the recognition performance of EmBERT-BiLSTM-CRF is significantly improved,proved that used pre-trained language model as the a word embedding layer can represent the characteristics of characters well and the Entity-level Masking strategy can alleviate the bias caused by incomplete semantics,thereby enhanced the Chinese semantic representation ability of the model,so that enabling the model to more accurately identify Chinese agricultural named entities.This research can not only provide arelatively high entity recognition accuracy for tasks such as agricultural intelligence question answering,but also offer new ideas for the identification of Chinese named entities in fishery,animal husbandry,Chinese medical,and biological fields.
作者 韦紫君 宋玲 胡小春 陈宁江 Wei Zijun;Song Ling;Hu Xiaochun;Chen Ningjiang(School of Computer and Electronics Information,Guangxi University,Nanning 530004,China;College of Information Engineering,Nanning University,Nanning 530200,China;Guangxi Key Laboratory of Multimedia Communications and Networks Technology,Nanning 530004,China;School of Information and Statistics,Guangxi University of Finance and Econ)
出处 《农业工程学报》 EI CAS CSCD 北大核心 2022年第15期195-203,共9页 Transactions of the Chinese Society of Agricultural Engineering
基金 国家重点研发计划课题(2018YFB1404404) 广西重点研发计划项目(桂科AB19110050) 南宁市科技重大专项(20211005)。
关键词 农业 命名实体识别 实体级遮蔽 BERT BiLSTM CRF agriculture named entity recognition entity-level masking BERT BiLSTM CRF
  • 相关文献

参考文献11

二级参考文献79

共引文献276

同被引文献162

引证文献8

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部