摘要
为减少现有的深度命名实体识别(NER)模型对高质量标注数据集的依赖,面向医学文本解析,提出一种基于半监督学习与RoBERTa多层表征融合的医学命名实体识别方法。该方法在RoBERTa-wwm-ext-BiLSTM-CRF多层表征融合模型基础上,设计伪标签方法扩充数据样本,构建噪声减弱模块以缓解伪标签数据中噪声的影响。在CCKS 2021医疗命名实体识别数据集和CBLUE CMeEE数据集上的实验结果表明,该方法与经典的BERT-BiLSTMCRF方法相比,F1值分别提升了1.14%和1.63%,表明引入半监督学习策略并融合RoBERTa多层表征信息的命名实体识别方法可以有效提高医学实体识别效果。
To reduce the dependence of existing deep named entity recognition(NER)models on high-quality annotated datasets,proposes a named entity recognition strategy based on semi-supervised learning and RoBERTa multi-layer fusion,aiming to boost the performance of Chi‐nese medical named entity recognition.Based on RoBERTa-wwm-ext-BiLSTM-CRF multi-layer representation fusion model,we use pseudo label to expand train set,and build a noise reduction module to mitigate the influence of noise in pseudo label data.We tested our model with the datasets of CCKS 2021 medical named entity recognition task and CBLUE CMeEE,the F1 scores of the proposed method is improved by 1.14% and 1.63%compared with the BERT-BiLSTM-CRF model.The proposed model with semi-supervised learning and RoBERTa multi layer representation fusion can effectively boost the performance of Chinese medical named entity recognition.
作者
张帅
高晓苑
杨涛
刘杰
ZHANG Shuai;GAO Xiao-yuan;YANG Tao;LIU Jie(School of Artificial Intelligence and Information Technology,Nanjing University of Chinese Medicine,Nanjing 210023,China;University of Chinese Academy of Sciences,Nanjing,Nanjing 211135,China)
出处
《软件导刊》
2023年第5期23-28,共6页
Software Guide
基金
国家自然科学基金项目(82174276)
中国博士后科学基金项目(2021M701674)
江苏省博士后科研资助计划项目(2021K457C)
江苏高校“青蓝工程”资助项目(2021)
江苏省研究生培养创新工程项目(KYCX21_1626)
江苏省大学生实践创新训练计划项目(202110315026)。
关键词
命名实体识别
半监督学习
预训练语言模型
深度学习
自然语言处理
named entity recognition
semi-supervised learning
pre-trained models
deep learning
natural language processing