摘要
针对中文电子病历命名实体语料标注空白的现状,研究了中文电子病历命名实体标注语料库的构建。参考2010年美国国家集成生物与临床信息学研究中心(1282)给出的电子病历命名实体类型及修饰类型的定义,在专业医生的指导下制定了详尽的中文电子病历标注规范;通过对大量中文电子病历的分析,提出了一套完整的中文电子病历命名实体标注方案,而且采用预标注和正式标注的方法,建立了一定规模的中文电子病历命名实体标注语料库,其标注语料的一致性达到了92%以上。该工作对中文电子病历的命名实体识别及信息抽取研究提供了可靠的数据支持,对医疗知识挖掘也有重要意义。
In view of the current blank in semantical annotatxon ot nameo enuuz~ ul ~,,, (CEMRs), a study on construction of annotated corpora for CEMRs' named entities was condueted. By reference to the definitions of named entity type and modification type of electronie medical records given by the US Informat- ics for Integrating Biology and the Bedside (I2B2) in 2010, an annotation specification for CEMRs was developed under the guidance of professional doctors; Based on the analysis of a large number of CEMRs, a complete scheme for annotation of CEMRs' named denties was proposed, and a large-scale annotated corpus for named entities of CEMRs was established by using the methods of pre-annotating and formal annotating. Its annotation consistency is over 92%. This annotated corpora can provide reliable data for named entity recognition for CEMRs and information extraction research, and it is very useful for medical knowledge mining.
出处
《高技术通讯》
CAS
CSCD
北大核心
2015年第2期143-150,共8页
Chinese High Technology Letters
基金
国家自然科学基金(60975077)资助项目
关键词
中文电子病历(CEMR)
命名实体
标注语料库
标注规范
标注一致性(IAA)
Chinese electronic medical record( CEMR), named entity, annotated corpus, annotation specifi-cation, inter-annotator agreement (IAA)