摘要
目的提出一种基于Re-entity新分词方法的条件随机场(CRF)模型,并与双向长短记忆神经网络(BiLSTM)-CRF和Lattice-长短记忆神经网络(LSTM)进行比较。方法比较了现有实体识别方法和模型后,针对2018年全国知识图谱与语义计算大会(CCKS2018)任务一“电子病历命名实体识别”,提出基于Re-entity的CRF、BiLSTM-CRF、Lattice-LSTM方法,并在不同语料库训练不同参数级别的字符向量集。分别将各方法引入神经网络模型中进行模型性能对比实验,最后分别基于句子级和篇级输入句长进行对比研究。结果CRF模型在最优特征工程的结果下引入Re-entity方法后性能得到提高,句子级的Lattice-LSTM模型在该任务上取得了89.75%的严格F1-measure,优于CCKS2018任务一的最高结果(89.25%)。结论基于Re-entity新分词方法的CRF模型可利用中文临床药物知识库有效提高电子病历中药物的识别率,Re-entity方法可改善数据预处理阶段分词导致的错误累加,Lattice结构可以更好地结合字符和词序列的潜在语义信息,同时句子级输入能有效提高神经网络模型的识别准确率。
Objective To propose a conditional random field(CRF)model based on the new word segmentation method Re-entity,and to compare with bi-directional long short-term memory neural network(BiLSTM)-CRF and Lattice-long short-term memory neural network(LSTM).Methods After analyzing the existing entity recognition methods,we proposed CRF method based on Re-entity,BiLSTM-CRF and Lattice-LSTM for the China Conference on Knowledge Graph and Semantic Computing in 2018(CCKS2018)task one:Chinese clinical named entity recognition,and trained character vector sets at different parameter levels based on different corpora.The comparative experiments on model performance were carried out in the different neural network models for each methods.Finally,the comparative study was carried out based on different input lengths such as the sentence level and the text level.Results Re-entity method can improve the performance of CRF model.Lattice-LSTM model based on sentence level achieved a strict F1-measure of 89.75%on this task,which was higher than the highest F1-measure(89.25%)on the task one of CCKS2018.Conclusion The CRF model based on Re-entity can effectively improve the recognition rate of traditional Chinese medicines in electronic medical records by using normalized Chinese clinical drug.Re-entity method can improve the error accumulation caused by word segmentation in data preprocessing.Lattice structure can better combine the latent semantic information of characters and word sequences.At the same time,sentence-level input can effectively improve the recognition accuracy of neural network models.
作者
潘璀然
王青华
汤步洲
姜磊
黄勋
王理
PAN Cui-ran;WANG Qing-hua;TANG Bu-zhou;JIANG Lei;HUANG Xun;WANG Li(Department of Medical Informatics,School of Medicine,Nantong University,Nantong 226001,Jiangsu,China;College of Computer Science and Technology,Harbin Institute of Technology,Shenzhen,Shenzhen 518055,Guangdong,China;Department of Rheumatology and Immunology,Changzheng Hospital,Naval Medical University (Second Military Medical University),Shanghai 200433,China;Department of Communication Engineering,School of Information Science and Technology,Nantong University,Nantong 226001,Jiangsu,China)
出处
《第二军医大学学报》
CAS
CSCD
北大核心
2019年第5期497-506,共10页
Academic Journal of Second Military Medical University
基金
国家重点研发计划(2018YFC0116902)
国家自然科学基金(81873915)
江苏省研究生科研与实践创新计划项目(KYCX17-1932)~~
关键词
计算机化病案系统
中文电子病历
实体识别
条件随机场
双向长短记忆神经网络
点阵长短记忆神经网络
computed medical records systems
electronic medical record
entity identification
conditional random field
bi-directional long short-term memory neural network
lattice-long short-term memory neural network