摘要
为了提高所抽取电子病历文本中语义信息的准确性,提出基于RoBERTa与字词融合的电子病历命名实体识别算法.采用预训练模型RoBERTa得到充分考虑上下文信息的字向量;然后对文本进行分词处理,再通过Word2Vec得到词向量;最后将两者进行融合传入双向长短记忆神经网络BiLSTM中进行训练,经过条件随机场CRF进行预测输出.在电子病历数据集上进行的对比实验表明,在采用3个评价指标的情况下,文中算法均明显优于经典的电子病历命名实体识别方法.
EMR(electronic medical recode)named entity recognition is an important means of medical information extraction.In order to improve the accuracy of semantic information in the extracted electronic medical record text,a named entity recognition algorithm based on RoBERTa(robustly optimized BERT pretraining approach)and word fusion is proposed.The algorithm first uses the pre-training model ROBERTa to get the word vector which takes full account of the context information;then the text is segmented,and then Word2Vec is used to get the word vector;finally,the two are fused and transmitted to the BiLSTM(bidirectional long short memory neural network)for training,and then the CRF(conditional random fields)is used to predict the output.The experimental results on EMR datasets show that the proposed algorithm is superior to the classical EMR named entity recognition method in the case of three evaluation indexes.
作者
王卫东
张志峰
徐金慧
杨习贝
WANG Weidong;ZHANG Zhifeng;XU Jinhui;YANG Xibei(School of Computer Science,Jiangsu University of Science and Technology,Zhenjiang 212100,China)
出处
《江苏科技大学学报(自然科学版)》
CAS
北大核心
2023年第2期47-52,共6页
Journal of Jiangsu University of Science and Technology:Natural Science Edition
基金
国家自然科学基金资助项目(51609110,51779110)
江苏省自然科学基金资助项目(BK20191461)
江苏省六大人才高峰资助项目(KTHY-064)。