摘要
针对中文医疗文献中的中文词边界模糊、分词歧义导致传统深度学习方法难以获取词汇语义信息的问题,提出了一种融合嵌入字词特征的中文医疗命名实体识别模型。首先,针对词向量缺失边界特征的问题,将词向量与词性、词边界特征拼接融合,结合注意力机制捕获字符间潜在的依赖权重等特征和增强词汇向量;其次,将通过BERT模型获得的字符向量与增强词汇向量拼接融合作为嵌入的基础上,利用BiLSTM模型提取上下文语义信息特征;最后通过CRF模型进行序列解码。利用瑞金医院标准化代谢性疾病管理中心(MMC)的糖尿病标注数据集对融合嵌入字词特征的中文医疗命名实体识别模型进行实验,获得了较好的结果。
A recognition model of Chinese-named medical entities embedded character characteristics was proposed according to the difficult access to lexical semantic information due to the fuzzy boundary of Chinese words and traditional deep learning method caused by word segment ambiguity in Chinese medical literature.The potential dependence weight between characters was captured and the reinforced word sector was established by joining and embedding the word sector into the word property and word boundary characteristics in combination with the attention mechanism.The context semantic information characteristics were then extracted by making use of the BiLSTM model based on the joined and embedded vocabulary vector and reinforced word sector established by making use of the BERT model used as embedded characters.The sequence was finally decoded by making use of the CRF model.The recognition model of Chinese-named medical entities embedded character characteristics achieved quite good results by making use of the MMC-labelled diabetic data in Ruijin Hospital.
作者
张厚昌
刘成良
ZHANG Hou-chang;LIU Cheng-liang(Shanghai Jiaotong University Mechanical and Power Engineering School,Shanghai 200240,China)
出处
《中华医学图书情报杂志》
CAS
2021年第9期42-49,共8页
Chinese Journal of Medical Library and Information Science
基金
国家重点研发计划项目“面向半失能老人的辅助机器人技术与系统”(2018YFB1307005)
上海市卫计委智慧医疗项目“基于人工智能的心律失常监测与大数据分析”(2018ZHYL0226)。