摘要
概念规范化是将医学文本中的术语与其在UMLS■等术语中的概念相关联的任务.传统医学概念规范化方法在很大程度上取决于数据的覆盖范围,这给非英语的其他语言带来了不便.研究专注于命名实体识别系统提取和标记的实体指称,并使用UMLS概念唯一标识符对实体指称进行分类.在无需翻译语言的情况下,利用可用术语和嵌入模型的多语言特性来改进中文的概念规范化.将中文语料医学概念规范化系统建设视为多分类问题,使用术语上下文信息对术语进行编码,并通过余弦相似度和softmax函数对其进行分类.通过基于医渡云结构化4K数据集的实验验证,即使在没有标记数据的情况下也可以取得良好的结果;在标记数据的情况下优于现有的监督方法.为处理非英语语言的医学文本提供了更简单、更有效的多语言方法.将大量医学专业知识和医学术语融合UMLS,扩大术语覆盖范围,加强推进医学概念规范化,提高临床科研的效率与质量.
Concept normalization is the task of associating terms in medical texts with concepts in terms such as UMLS■.The normalization method of traditional medical concepts largely depends on the coverage of the data,which brings inconvenience to languages other than English.Focus on the textual mentions already extracted and labeled by a named entity recognition system,and classify these mentions with a UMLS concept unique identifier.Without the need to translate languages,the multilingual features of available terminology and embedded models can be used to improve the normalization of Chinese concepts.Regarding the construction of a Chinese corpus medical concept normalization system as a multi-classification problem,the term context information is used to encode terms,and they are classified by cosine similarity and softmax function.Based on the structured 4K data set of Yidu Cloud,the method used in the experiment can achieve good results even without labeled data;it is better than the existing supervision methods in the case of labeled data.Both Chinese and English terminology training systems can greatly improve the performance of the system on Chinese benchmark tasks.Because there is no need for documents with concept tags,the remote supervision method used in the experiment can be applied to any type of document in the medical field.It provides a simpler and more effective multilingual method for processing medical texts in non-English languages.Integrate a large amount of medical expertise and medical terminology into UMLS,expand the coverage of terminology,strengthen the normalization of medical concepts,and improve the efficiency and quality of clinical scientific research.
作者
易晓宇
易绵竹
YI Xiaoyu;YI Mianzhu(Luoyang Campus of Information Engineering University,Luoyang 471000,China)
出处
《河南科技学院学报(自然科学版)》
2022年第2期70-76,共7页
Journal of Henan Institute of Science and Technology(Natural Science Edition)
基金
国防科技创新特区项目(18-H863-01-ZT-005-005-01)。
关键词
自然语言处理
信息提取
医学概念规范化
多语言表示
natural language processing
information extraction
normalization of medical concept
multilingual representation