摘要
针对基于字符表示的中文医学领域命名实体识别模型嵌入形式单一、边界识别困难、语义信息利用不充分等问题,一种非常有效的方法是在Bret底层注入词汇特征,在利用词粒度语义信息的同时降低分词错误带来的影响,然而在注入词汇信息的同时也会引入一些低相关性的词汇和噪声,导致基于注意力机制的Bret模型出现注意力分散的情况。此外仅依靠字、词粒度难以充分挖掘中文字符深层次的语义信息。对此,提出基于注意力增强与特征融合的中文医学实体识别模型,对字词注意力分数矩阵进行稀疏处理,使模型的注意力集中在相关度高的词汇,能够有效减少上下文中的噪声词汇干扰。同时,对汉字发音和笔画通过卷积神经网络(CNN)提取特征,经过迭代注意力特征融合模块进行融合,然后与Bret模型的输出特征进行拼接输入给Bi LSTM模型,进一步挖掘字符所包含的深层次语义信息。通过爬虫等方式搜集大量相关医学语料,训练医学领域词向量库,并在CCKS2017和CCKS2019数据集上进行验证,实验结果表明,该模型F1值分别达到94.90%、89.37%,效果优于当前主流的实体识别模型,具有更好的识别效果。
To address problems such as single embedding forms,difficult boundary recognition,and insufficient use of semantic information in Chinese medical named entity recognition models based on character representation,an effective method is to inject lexical features at the bottom of Bret.This approach reduces the impact of word segmentation errors while utilizing word granularity semantic information.However,some low correlation words and noise are introduced when vocabulary information is injected,leading to attention distraction in the Bret model based on the attention mechanism.In addition,it is difficult to fully mine deep semantic information of Chinese characters by relying solely on word granularity.Therefore,this study proposes a Chinese medical entity recognition model based on attention enhancement and feature fusion.The sparse processing of the attention score matrix of words causes the model to focus on words with a high correlation,which can effectively reduce the interference of noisy words in the context.Simultaneously,Convolutional Neural Networks(CNNs)are used to extract the features of Chinese pronunciation and strokes,which are fused with the output features of the Bret model through an iterative attention feature fusion module and subsequently concatenated to the BiLSTM model to further mine the deep semantic information contained in characters.During the experiment,a large number of relevant medical corpora is collected using a crawler and other methods.Further,a medical field word vector library is trained and verified on the CCKS2017 and CCKS2019 datasets.The experimental results show that the F1 values of the model reach 94.90% and 89.37%,respectively,which are higher than those with current mainstream entity recognition models.Therefore,the proposed model exhibits higher recognition performance.
作者
王晋涛
秦昂
张元
陈一飞
王廷凤
谢承霖
邹刚
WANG Jintao;QIN Ang;ZHANG Yuan;CHEN Yifei;WANG Tingfeng;XIE Chenglin;ZOU Gang(School of Computer Science and Technology,North University of China,Taiyuan 030051,Shanxi,China;Hunan Provincial Tumor Hospital,Changsha 410031,Hunan,China;The Affiliated Hospital of Hunan Academy of Traditional Chinese Medicine,Changsha 410006,Hunan,China;Hunan ZK Help Innovation Intelligent Technology Research Institute,Changsha 410076,Hunan,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2024年第7期324-332,共9页
Computer Engineering
基金
湖南省自然科学基金(2022JJ70022)。
关键词
实体识别
中文分词
注意力稀疏
特征融合
医学词向量库
entity recognition
Chinese word segmentation
sparse attention
feature fusion
medical word vector library