摘要
近年来,电子病历文本数据不断增长,这为医学研究提供了丰富的知识来源.结合领域需求,采用有效的文本挖掘技术从电子病历文本中自动快速、准确地获取医疗知识,将对医疗健康领域的研究产生极大的推动作用.中文临床电子病历命名实体识别作为中文医学信息抽取的基本任务,已经受到了广泛关注.目前大多数中文电子病历实体识别工作都是在传统通用的文本表示向量基础上,通过特征工程来提升模型在医疗领域上的性能,缺乏适合中文生物医学特定领域的预训练表示向量.此外,目前现存的中文电子病历标注数据十分稀缺,标注电子病历实体需要具备专业的医学背景知识,且耗时耗力.针对这些问题,本文提出了一种基于笔画ELMo和多任务学习的中文电子病历实体识别方法.首先以笔画序列为输入对ELMo表示学习方法进行改进,利用海量无标注的中文生物医学文本学习上下文相关且包含汉字内部结构信息的笔画ELMo向量,然后构建基于多任务学习的神经网络模型来充分利用现存数据提升模型性能.此外,本文还系统地比较了实体识别常用额外特征(包括词向量、词典和部首特征)以及主流神经网络模型(包括CNN、BiLSTM、CNN-CRF和BiLSTM-CRF模型)在中文电子病历实体识别任务上的性能.实验结果表明,在该任务上BiLSTM-CRF模型获得了比其它模型更好的结果,常用额外特征中词典特征最为有效.相比其它现存方法,本文提出的基于笔画ELMo和多任务学习的神经网络模型在CCKS17和CCKS18 CNER数据集上都获得了更好的结果,F值分别为91.75%和90.05%.
In recent years,the number of electronic medical record text has grown substantially,which provides a rich source of knowledge for medical research.According to the medical domain demand,effective text mining technology can obtain medical related information from the massive electronic medical records efficiently and accurately,which will greatly promote the research in the medical health field.Chinese Clinical Named Entity Recognition(CNER)is a fundamental task for Chinese medical information extraction,which has received much attention.However,most of the existing Chinese CNER works are based on traditional text representation embeddings(i.e.,context-independent representation for each word)and depend on effective feature engineering to improve the performance of models in the medical field.There is less related work in Chinese biomedical pretrained text embeddings.In addition,the existing Chinese CNER dataset size is small,and medical entity annotation requires medical background knowledge,which is time-consuming and labor-intensive.To address the problems,this paper proposes a Chinese CNER method based on stroke ELMo and multi-task learning.Firstly,a stroke ELMo(Embeddings from Language Models)model is proposed to obtain Chinese pretrained text representation.The ELMo method is improved by taking the stroke sequence as input.It is a context-dependent representation method and can learn rich structure information of the Chinese characters from the large Chinese biomedical text corpus.To learn high quality Chinese biomedical text representations,the massive Chinese medical abstracts were downloaded from the CNKI website.Then these abstracts and the Chinese electronic medical record texts provided by the China Conference on Knowledge Graph and Semantic Computing(CCKS)challenge were used to train the stroke ELMo embeddings.The experimental results show that stroke ELMo embeddings achieve the better performance than the traditional word2 vec embeddings.When the concatenation of the word2 vec and stroke ELMo embeddings as input is fed into the model,the model obtains the best performance.Secondly,we explored the effect of multi-task learning on the Chinese CNER task.The single task model,fully-shared multi-task learning model and shared-private multi-task learning model are compared on the CCKS17 and CCKS18 data sets.The experimental results show that the shared-private multi-task learning model achieves the best F-score.It can utilize the correlation of the tasks to improve the model performance and make full use of the existing datasets.We also tested the performance of the multi-task learning model on the different sizes training data sets.The sharedprivate multi-task learning model trained on only 60% of the training data can achieve better performances than the single task model trained on the complete training data on the CCKS17 and CCKS18 CNER datasets.Moreover,the effects of common NER features(i.e.,word embedding,dictionary and radical features)and neural network models(i.e.,CNN,BiLSTM,CNN-CRF and BiLSTM-CRF models)were investigated for the Chinese CNER task.The experimental results show that the BiLSTM-CRF model outperforms the other models.Among other features,the dictionary feature is most effective.Finally,compared with other existing methods,our neural network model based on stroke ELMo and multi-task learning achieves better performances on the CCKS17 and CCKS18 CNER datasets(the F-scores of 91.75% and 90.05%,respectively).
作者
罗凌
杨志豪
宋雅文
李楠
林鸿飞
LUO Ling;YANG Zhi-Hao;SONG Ya-Wen;LI Nan;LIN Hong-Fei(School of Computer Science and Technology,Dalian University of Technology,Dalian,Liaoning 116024)
出处
《计算机学报》
EI
CSCD
北大核心
2020年第10期1943-1957,共15页
Chinese Journal of Computers
基金
十三五国家重点研发计划项目(2016YFC0901900)资助.
关键词
笔画ELMo
多任务学习
神经网络
实体识别
中文电子病历
stroke ELMo
multi-task learning
neural networks
named entity recognition
Chinese electronic medical records