摘要
【目的】对基于神经网络的中文医学文本命名实体识别模型进行分解,研究符号表示层和上下文编码层中基础神经网络模块功能以及多神经网络模块协同组合对实体识别性能的影响。【方法】基于CCKS2017、CCKS2019和IMCS-NER等中文医学文本命名实体识别任务发布的基准语料,对比分析神经网络模型的符号表示层和上下文编码层采用不同神经网络模块时的性能差异。以此为基础,分别构建将多神经网络模块集成、并联、串联的实体识别模型,比较并分析其性能差异。【结果】符号表示层使用hfl/chinese-macbert-base、hfl/chinese-roberta-wwm-ext、hfl/chinese-bert-wwm-ext等预训练语言模型能显著提高识别性能,平均F1值分别达到0.8816、0.8816、0.8812;在上下文编码层融合神经网络模块能够提高识别性能。其中,基于集成的神经网络性能最优,F1值分别达到0.9330、0.8211、0.9181。【局限】实验仅基于中文医学文本语料,所得结论有待在其他语种的语料上进行验证。【结论】基础神经网络模块的类型和多神经网络模块的协同方式显著影响神经网络在中文医学文本命名实体识别任务上的表现。
[Objective]This paper decomposes the named entity recognition models based on neural network for Chinese medical texts.We investigate the impacts of single neural network module and the collaboration of multiple modules on the entity recognition performance.[Methods]First,we chosed the benchmark datasets from CCKS2017,CCKS2019,and IMCS-NER for named entity recognition tasks.Then,we conducted extensive experiments to compare the performance of different single modules of the aforementioned layers.Third,we built and compared entity recognition models based on ensemble,parallel,and serial neural models.[Results]Using hfl/chinese-macbert-base,hfl/chinese-roberta-wwm-ext,hfl/chinese-bert-wwm-ext in the symbolic representation layer significantly improved the performance of entity recognition models,the average F1-scores reached 0.8816,0.8816 and 0.8812 respectively.Stacking neural models at the context encoding layer improved the performance of the neural network.Moreover,ensembled neural networks could achieve the best performance,the F1-scores reached 0.9330,0.8211 and 0.9181 respectively.[Limitations]More research is needed to examine our findings with datasets in other languages.[Conclusions]The characteristics of single neural modules and their collaboration could significantly affect the performance of the named entity recognition of Chinese medical texts.
作者
段宇锋
贺国秀
Duan Yufeng;He Guoxiu(Faculty of Economics and Management,East China Normal University,Shanghai 200062,China)
出处
《数据分析与知识发现》
CSCD
北大核心
2023年第2期26-37,共12页
Data Analysis and Knowledge Discovery
基金
国家社会科学基金项目(项目编号:20BTQ092)的研究成果之一。
关键词
命名实体识别
神经网络
模块分解
中文医学文本
Named Entity Recognition
Neural Network
Module Decomposition
Chinese Medical Text