摘要
【目的】利用迁移学习和多任务学习解决中文医学文献实体识别冷启动和边界定位难的问题,进一步提高识别准确性。【方法】提出一种基于迁移学习和多任务学习的中文医学文献实体识别方法,构建混合深度学习BERT-BiLSTM-IDCNN-CRF的医学文献实体识别模型,通过实例迁移、模型迁移和特征迁移丰富医学语义特征,利用多任务学习构建粗粒度三分类任务以辅助实体识别任务有效利用实体边界信息,最后引入自注意力机制和Highway网络捕获全局重要信息并优化深层网络训练,提出TLMT-BBIC-HS模型。【结果】TLMT-BBIC-HS模型在中文糖尿病医学文献数据集上F1值达92.98%,较基准模型BERT-BiLSTM-CRF和BERT-IDCNN-CRF分别提高15.99个百分点和16.44个百分点。【局限】未验证模型的领域适应性。【结论】TLMT-BBIC-HS模型可实现医学知识的迁移共享,更适用于中文医学文献实体识别任务,可为医疗健康信息抽取、知识图谱和问答系统构建提供有效支持。
[Objective]This paper uses transfer learning and multi-task learning to solve the problems of cold start and boundary in Chinese medical literature entity recognition,and further improve the recognition accuracy.[Methods]Firstly,we constructed a hybrid deep learning BERT-BiLSTM-IDCNN-CRF medical literature entity recognition model.Secondly,based on transfer learning,the medical semantic features were enriched through instance,model and feature transfer.Thirdly,we constructed a coarse-grained three-classification task through multi-task learning to assist the main task in utilizing the entity boundary information effectively.Finally,we introduced the self-attention mechanism and highway network to capture global information,optimize deep network training and establish the TLMT-BBIC-HS model.[Results]The model had an F1 value of 92.98%on the Chinese diabetes medical literature dataset,which is 15.99%and 16.44%higher than the benchmark models BERT-BiLSTM-CRF and BERT-IDCNN-CRF.[Limitations]The domain suitability of this model needs to be verified.[Conclusions]The TLMT-BBIC-HS model can transfer and share medical knowledge,which is more suitable for Chinese medical Literature entity recognition.It could effectively extract medical information and construct knowledge graphs and question answering systems.
作者
韩普
顾亮
叶东宇
陈文祺
Han Pu;Gu Liang;Ye Dongyu;Chen Wenqi(School of Management,Nanjing University of Posts&Telecommunications,Nanjing 210003,China;Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service,Nanjing 210023,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2023年第9期136-145,共10页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金项目(项目编号:22BTQ096)的研究成果之一。