期刊文献+

基于条件随机域的临床文本去识别研究 被引量:3

De-identification of Clinical Texts Based on Conditional Random Fields
下载PDF
导出
摘要 目的本研究旨在有效识别临床文本中的隐私信息,以保护患者隐私,实现临床信息的共享与利用,促进基于临床医学证据研究的发展。方法采用中国四川省某市级区域人口健康信息平台随机抽取的18350条住院小结作为样本,基于条件随机域(conditional random field,CRF)模型,对样本数据中多种PHI类型进行识别。结果人工标注PHI实体总数为32210个,标注一致性达到92.7%,经过对不一致标记进行校正审核,一致性收敛至100%。测试结果评估显示,除病理号、X线片号、89岁以上的年龄以外,其他PHI类型的F值均超过95%,综合F值达到98.72%。结论本研究基于大规模多样化临床文本数据,利用机器学习的方法,实现了高效的自动化临床文本去识别。对于在保护模型的基础上开发更高效的健康大数据去识别算法以及保证去识别技术的一般性和可扩展性的研究还有待探索。 Objectives In order to achieve the sharing and utilization of clinical information, and promote the development of clinical medicine research based on evidence, and protect the patient’s privacy information effectively at the same time, the effective identification of clinical privacy information in the text is very important. Methods This research adopted 18350 discharge summaries from the municipal area population health information platform in China’s Sichuan province. This paper proposed a method based on CRF model to identify various protected health information. Results The total number of protected health information by manual annotation was 32210. After the audit to the inconsistent annotation, the consistency changed from 92.7% to 100%. Except for pathology number, X-ray number and age more than 89-year-old, F-measure of other types were more than 95%, comprehensive F-measure reached 98.72%. Conclusion This research was based on machine learning method and adopted large-scale diversified data. And we realized effective identification of clinical privacy information in the text. The future direction of the automated clinical text de-identification research includes: Ensuring the generality and expansibility of identifying technology; developing the more efficient de-identification algorithm for health big data on the basis of protecting model.
出处 《中国卫生信息管理杂志》 2017年第2期217-222,共6页 Chinese Journal of Health Informatics and Management
基金 中央高校基本科研业务费资助项目:区域医疗机构知识网络形成机制研究(项目编号:2015AE017)
关键词 去识别 临床文本 PHI CRF De-identification Clinical text PHI CRF
  • 相关文献

参考文献4

二级参考文献25

  • 1俞鸿魁,张华平,刘群,吕学强,施水才.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学报,2006,27(2):87-94. 被引量:157
  • 2Doan A,Naughton JF,Ramakrishnan R,et al.Information extraction challenges in managing unstructured data[J].ACM SIGMOD Record,2008,37(4):14-20.
  • 3Vlachos A,Gasperin C.Bootstrapping and evaluating named entity recognition in the biomedical domain[C]//Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology.New York:Association for Computational Linguistics Morristown,2006:138-145.
  • 4Bundschus M,Dejori M,Stetter M,et al.Extraction of semantic biomedical relations from text using conditional random fields[J].BMC Bioinformatics,2008,9:207.
  • 5Leaman R,Gonzalez GR.BANNER:An executable survey of advances in biomedical named entity recognition[C]//Proceedings of Pacific Symposium on Biocomputing.Hawaii:World Scientific Publishing Co.Pte.Ltd,2008:652-663.
  • 6Leaman R,Miller C,Gonzalez G.Enabling recognition of diseases in biomedical text with machine learning:Corpus and benchmark[C]//Proceedingsof the 3rdInternational Symposium on Lagauges in Biology and Medicine.Seogwipo-si.LBM,2009:82-89.
  • 7Tsai Tzong-ham,Chou Wen-Chi,Wu Shih-Hung,et al.Integrating Linguistic Knowledge into a Conditional Random Field Framework to Identify Biomedical Named Entities[J].Expert Systems with Applications,2006,30(1):117-128.
  • 8Sun ChengJie,Guan Yi,Wang XiaoLong,et al.Biomedical named entities recognition using conditional random fields model[J].Lecture notes in computer science,2006,4223:1279-1288.
  • 9Salem ABM.Case based reasoning technology for medical diagnosis[J].World Academy of Science,Engineering and Technology,2007,25:9-13.
  • 10Rossille D,Laurentc JF,Burgun A.Modelling a decisionsupport system for oncology using rule-based and case-based reasoning methodologies[J].International Journal of Medical Informatics,2005,74:299-306.

共引文献137

同被引文献26

引证文献3

二级引证文献27

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部