摘要
中医临床病历是中医重要的科研数据资源,但目前临床病历仍以文本为主要表达形式,对病历数据深入分析的前提是进行结构化信息抽取,而命名实体抽取是其基础性步骤。针对中医临床病历的命名实体,如症状、疾病和诱因等的抽取问题,通过手工标注的413份病历数据(以中文字为特征)与4类特征模版,将条件随机场(CRF)、隐马尔科夫模型(HMM)和最大熵马尔科夫模型(MEMM)用于中医病历命名实体抽取的实验,并进行比较分析。结果表明,结合合适的特征模版,CRF命名实体抽取方法取得了较好的性能,F1值的症状达到0.80,疾病名称达到0.74,诱因0.74。与HMM和MEMM相比,CRF有最高的准确率和召回率,是一种较为适用的中医临床病历命名实体抽取方法。
Traditional Chinese Medicine(TCM)medical records are the important data resources of the TCM medical research. The main form of them is still text now,and it is necessary to extract the structured information from the medical records,while named entity extraction is the basic step. It makes413 copies of manually labeled medical records in Chinese text and four types of feature templates to study about the named entity extraction practice such as symptoms,diseases and incentives. It compares the results of TCM medical records named entity extraction by Conditional Random Field(CRF),Hidden Markov Model(HMM)and Maximum Entropy Markov Model(MEMM). Combined with appropriate feature templates,CRF has well performance of F1:symptoms0.80,the name of the disease0.74,incentives0.74. Compared with HMM and MEMM,CRF has the highest precision and recall rate. This preliminary shows that CRF is an applicable method of the Chinese medical records named entity extraction.
出处
《计算机工程》
CAS
CSCD
2014年第9期312-316,共5页
Computer Engineering
基金
国家自然科学基金资助项目(61105055
81230086)
国家"863"计划基金资助项目(2012AA02A609)
中央高校基本科研业务费专项基金资助项目(K13JB00140)
关键词
中医临床病历
命名实体抽取
语料库标注系统
条件随机场
特征模板
Traditional Chinese Medicine(TCM)medical records
named entity extraction
corpus annotation system
Conditional Random Field(CRF)
feature template