摘要
中文电子病历NER是医疗信息抽取的难点。本文提出一种多任务学习的实体识别方法,联合实体识别和分词训练模型,使用基于Bi-LSTM的私有层提取专有信息,融合注意力网络作为共享层并增加通用特征增强机制来筛选全局信息,降低过拟合风险并增强模型的泛化能力。此外提出均衡样本过采样方法扩充数据集,有效解决实体类别不平衡所带来的问题。使用CCKS2017/CCKS2020电子病历实体识别语料和Medicine医药分词语料联合训练,实验结果显示本文提出的模型整体性能提升明显,同时也显著提高了Medicine语料的分词实验效果,F1值较基线提升了3个百分点。实验表明本文提出的模型能够有效改善因电子病历中数据不规范、无结构或专有名词等原因造成的实体切分错误等问题。
Named entity recognition of Chinese EMR is the difficulty in medical information extraction. This paper proposes a multi-task learning mechanism to recognize entity which jointly entity recognition and word segmentation training. The private layers based on Bi-LSTM are used to extract private features,the attention network is used as the shared layer and the general feature enhancement mechanism is added to filter the gobal information,which reduces the risk of over-fitting and enhanced the model generalization ability. Moreover,the balanced oversampling method is proposed to augment EMR dataset,which effectively solves the problem caused by the huge discrepancy in EMR entity types. The CCKS2017/CCKS2020 EMR entity recognition dataset and medicine word segmentation dataset are used for joint learning. The experimental results show that the overall performance in EMR entity recognition is significantly improved,and the word segmentation benchmark in medicine dataset is also raised by3 percent points in F1 value. The detailed analysis show that the proposed model can effectively correct the entity chunking errors caused by irregular writing style,unstructured text or professional nouns/terms in EMR dataset.
作者
于鹏
陈钰枫
徐金安
张玉洁
YU Peng;CHEN Yu-feng;XU Jin-an;ZHANG Yu-jie(School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China)
出处
《计算机与现代化》
2022年第9期40-50,共11页
Computer and Modernization
基金
国家自然科学基金面上项目(61976016,61976015,61876198)。
关键词
深度学习
命名实体识别
多任务学习
神经网络
注意力机制
deep learning
named entity recognition
multi-task learning
neural network
attention mechanism