摘要
病理检查报告中的文本通常为非结构化数据,不利于计算机自动分析和处理.目前文本结构化主要采用信息关系抽取方法,然而病理检查报告所具有的语义特殊性,给中文信息关系抽取带来了挑战.为解决上述问题,设计了一种针对病理检查报告的结构化方法,首先通过神经网络语言模型获得病理报告中的同义词表,合并一义多词现象;在此基础上,生成病理检查报告文本的依存关系树,并提出切分短句和信息标注的剪裁策略,以简化初始生成的依存关系树结构,从而使语法关系更加清晰,提高结构化结果的准确度;进而,利用依存句法分析结果从中文检查报告中提取指标及对应指标值,并自动生成结构化模板.实验采用医生真实使用的医疗病理检查报告进行验证,其结果表明:该方法在指标词和对应指标值提取任务中的准确率可以分别达到82.91%和79.11%,为相关研究打下了基础.
Most of pathological reports are unstructured texts which can not be directly analyzed bycomputers.T h e current researches on structured texts mainly focus on the information extraction.H o w e v e r,the syntactic features of pathological reports are particular,which makes it more difficult toextract information relations.T o solve this problem,a novel method of structuralizing pathologicalreports based on syntactic and semantic features is proposed in this paper.First of all,w e construct as y n o n y m lexicon by using neural network language models to eliminate the phe n o m e n o n of s y n o n y m y.T h e n the dependency trees are generated based on the preprocessed pathological reports to extractmedical examination indices.M e a n w h i l e,w e use short-sentence segmentation and annotation asoptimized strategies to simplify the structure of dependency trees,which makes the grammaticalrelations of medical texts clearer and improves the quality of the structured results.Finally the keyvaluepairs of medical examination indices can be extracted from pathological reports in Chinese,andthe structured texts can be generated automatically.Experimental results based on real pathologicalreport data sets s h o w that the performance of the proposed method on medical indices and valuesextraction achieves82.91%and79.11%of accuracy,which provides a solid foundation for relatedstudies in the future.
作者
田驰远
陈德华
王梅
乐嘉锦
Tian Chiyuan;Chen De hua;Wang Mei;Le Jiajin(College of Computer Science and Technology,Donghua University,Shanghai 201620)
出处
《计算机研究与发展》
EI
CSCD
北大核心
2016年第12期2669-2680,共12页
Journal of Computer Research and Development
基金
上海市科技创新行动计划项目(15511106900)
上海市科技发展基金项目(16JC1400802)
中央高校基本科研业务费东华大学励志计划项目(B201312)
上海市信息化发展专项资金项目(XX-XXFZ-01-14-6349)~~
关键词
医疗数据
病理报告
依存句法分析
文本结构化处理
神经网络语言模型
medical data
pathological reports
dependency parsing
text structured processing
neural network language model