期刊文献+

基于词义增强的生物医学命名实体识别方法

Biomedical Named Entity Recognition Method Based on Word Meaning Enhancement
下载PDF
导出
摘要 生物医学命名实体识别(BioNER)是生物医学文本挖掘的核心任务之一,能够为下游任务提供有力支撑。与通用领域相比,生物医学数据中存在更多的未登录词,现有BioNER方法通常将未登录词拆分为语素进行表示学习,这种方法缓解了未登录词表示信息不足的问题,但是破坏了单词的内部信息,对语素进行标签预测时容易出现标签不一致和跨实体标签问题。此外,将单词分割为语素导致句子长度变长,加重了训练中存在的梯度消失问题。提出一种通过BiLSTM-Biaffine结构进行词义增强的BioNER方法。通过BioBERT预训练模型获取语素表示信息,使用BiLSTM-Biaffine进行词义增强,在单词层面利用BiLSTM分别获取语素的前向和后向序列信息,采用Biaffine注意力机制增强其关联信息并重新融合为单词表示,最后通过BiLSTM-CRF模型获取输入句子的标签序列。实验结果表明,在数据集BC2GM、NCBI-Disease、BC5CDR-chem和JNLPBA上,该方法的F1值分别达到84.94%、89.07%、92.14%和74.57%,与主流序列标注模型MTM-CW、MT-BioNER等相比平均分别提高了2.99、1.84、3.09和1.03个百分点,验证了所提方法在BioNER任务中的有效性。 Biomedical Named Entity Recognition(BioNER),as a core task of biomedical text mining,provides strong support for downstream tasks.There are more unregistered words in biomedical data than in the general domain.Existing BioNER methods usually use the method of splitting unregistered words into morphemes to alleviate the problem of insufficient information of unregistered words;however,the internal information of words is also split,resulting in label inconsistency.Thus,cross-entity label problems are prone to occur in label prediction for morphemes.In addition,the segmentation of words into morphemes leads to longer sentence lengths,which aggravates the problem of gradient disappearance during training.To address the above problems,a BioNER method that uses the Bidirectional Long Short-Term Memory(BiLSTM)-Biaffine structure is proposed for word meaning enhancement.First,morpheme representation information is obtained through the BioBERT pre-training model.Subsequently,BiLSTM-Biaffine is used to enhance the word sense,with BiLSTM at the word level to obtain forward and backward sequence information of the morpheme and the Biaffine attention mechanism to enhance the associated information and reintegrate it into the words representation.Finally,the label sequence of the input sentence is obtained through the BiLSTM-CRF model.The experimental results show that on the BC2GM,NCBI-Disease,BC5CDR-chem,and JNLPBA datasets,the F1 scores of the method reached 84.94%,89.07%,92.14%,and 74.57%,respectively.Compared with mainstream sequence annotation models such as the MTM-CW and MT-BioNER,the proposed method provided an average improvement of 2.99,1.84,3.09,and 1.03 percentage points,respectively,verifying its effectiveness in BioNER tasks.
作者 陈梦萱 陈艳平 扈应 黄瑞章 秦永彬 CHEN Mengxuan;CHEN Yanping;HU Ying;HUANG Ruizhang;QIN Yongbin(State Key Laboratory of Public Big Data,Guizhou University,Guiyang 550025,China;College of Computer Science and Technology,Guizhou University,Guiyang 550025,China)
出处 《计算机工程》 CAS CSCD 北大核心 2023年第10期305-312,共8页 Computer Engineering
基金 国家自然科学基金(62166007)。
关键词 生物医学命名实体识别 语素 词义增强 双向长短期记忆网络 注意力机制 Biomedical Named Entity Recognition(BioNER) morpheme word meaning enhancement Bidirectional Long Short-Term Memory(BiLSTM)network attention mechanism
  • 相关文献

参考文献3

二级参考文献2

共引文献142

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部