摘要
利用生物医学术语系统中的词汇和概念,为存有大量珍贵信息的非结构化临床文档建立有效的索引,以便进行信息挖掘和利用,国际上相关研究已经开展多年,而基于中文病历文档概念索引的研究尚属空白。本研究将现有的中文版的国际疾病分类(ICD)集成到统一医学语言系统(UMLS)中,依据中文语言处理的特殊性,对中文电子病历文档进行统计分析,提出了一套中文病历文档术语提取和否定检出的方法,可用于建立中文病历文档的概念索引。术语提取阶段利用高灵敏的最大匹配法并结合通用分词技术来控制假阳性;而在概念否定意义检出部分,充分利用中文特点并基于现有中文处理技术提出了一种简化的子句模式匹配方法。选取了两组医疗文本数据集对算法进行了验证,术语提取算法的灵敏性分别为99.51%和100%,误检率分别为1.46%和1.66%。否定检出算法的阳性预测值均为100%,阴性预测值分别为100%和98.99%,除标点使用不规范等文书质量问题外,基本可以正确检出。
Narrative clinical documents contain a wealth of information for medical study.Indexing these documents using concepts in a biomedical terminology can improve information retrieval and mining in medical records.International studies in this domain have developed for several years,but the study based on Chinese clinical document remains a blank.After analyzing special character of Chinese medical language,this paper integrated Chinese version of International Classification of Disease(ICD) to the Unified Medical Language System(UMLS) terminology system and proposed a set of term extraction and negation detection method for Chinese clinical document which could be used to build concept-based index for documents.In the term extract phase the high-sensitivity Reverse Maximum Matching(RMM) method was used and a general Chinese word segmentation tool was used to decline false positive results.In negation detection phase,a simplified syntax pattern matching was proposed.Two algorithms were tested and evaluated in 2 clinical documents data sets.Term extract algorithm had a sensitivity of 99.51% and 100% while wrong detection rate 1.46% and 1.66%.Both negation detection algorithms had a positive predictive value of 100%,and negative predictive values of 100% and 98.99%.The negation detection algorithm could perfectly work except unusual punctuation used in clinical documents.
出处
《中国生物医学工程学报》
CAS
CSCD
北大核心
2008年第5期716-721,734,共7页
Chinese Journal of Biomedical Engineering
基金
国家863项目(2006AA02Z348)
关键词
医学语言处理
术语提取
否定检出
medical language processing
term extract
negation detection