摘要
作为医学信息抽取的第一个关键环节,医学命名实体识别任务旨在从如电子医疗病例、中文医药说明书等非结构化文本中抽取出医学相关的实体.目前大多数中文医学命名实体识别工作通过在预训练模型上进行微调来获得文本表示向量,然后利用特征工程来提升模型在医疗领域上的性能.这些模型大部分源自在通用数据集上表现较好的模型,没有考虑中文医学数据集的语言特性.通过在多个医学数据集上进行统计分析,发现部分类型的医学实体在字形上具有共性,如在汉字中大部分表示疾病含义的字符都包含“疒”,大部分表示身体器官的字符都包含“月”.针对这些问题,本文提出了一种基于字形特征的中文医学命名实体识别方法,该方法通过在文本表示向量上融合字形向量以及进一步利用数据集中负样本来提升模型的准确度和泛化能力.在多个公共的中文医学数据集上的实验结果表明,该方法获得了比其他模型更好的效果,并且通过消融实验证明了融合字形特征和从负样本中学习对于该任务是有效的.
As the first key link in medical information extraction,the medical named entity recognition task aims to extract medical-related entities from unstructured texts such as electronic medical records and Chinese medical instructions.Most current Chinese medical named entity recognition works obtain text representation vectors by fine-tuning pre-trained models,and then use feature engineering to improve the performance of the models in the medical field.Most of these models are derived from models that perform well on general-purpose datasets,without considering the language characteristics of Chinese medical datasets.Through statistical analysis on multiple medical data sets,it is found that some types of medical entities have similarities in glyphs.For example,in Chinese characters,most of the characters representing diseases contain“疒”,and most of the characters representing body organs contain“月”.In response to these problems,this paper proposes a Chinese medical named entity recognition method based on glyph features.This method improves the accuracy and generalization ability of the model by fusing the glyph vector on the text representation vector and further utilizing the negative samples in the dataset.Experimental results on multiple public Chinese medical datasets show that this method achieves better results than other models,and ablation experiments prove that fusing glyph features and learning from negative samples is effective for this task.
作者
孟伟伦
郭景峰
邢珂萱
魏宁
王巧梭
刘滨
MENG Wei-lun;GUO Jing-feng;XING Ke-xuan;WEI Ning;WANG Qiao-suo;LIU Bin(School of Information Science and Technology,Yanshan University,Qinhuangdao,Hebei 066004,China;The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province,Qinhuangdao,Hebei 066004,China;Hebei Construction Material Vocasional and Technical College,Qinhuangdao,Hebei 066000,China;Big data and Social Computing Research Center,Hebei University of Science and Technology,Shijiazhuang,Hebei 050018,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2024年第6期1945-1954,共10页
Acta Electronica Sinica
基金
河北省省级科技计划(No.21310101D)
中央引导地方科技发展资金(No.226Z0102G)
国家文化和旅游科技创新工程(2020年度)。
关键词
字形
负样本
两阶段
医学信息
命名实体识别
深度学习
glyph feature
negative sample
two stages
medical information
named entity recognition
deep learning