摘要
目的本研究旨在有效识别临床文本中的隐私信息,以保护患者隐私,实现临床信息的共享与利用,促进基于临床医学证据研究的发展。方法采用中国四川省某市级区域人口健康信息平台随机抽取的18350条住院小结作为样本,基于条件随机域(conditional random field,CRF)模型,对样本数据中多种PHI类型进行识别。结果人工标注PHI实体总数为32210个,标注一致性达到92.7%,经过对不一致标记进行校正审核,一致性收敛至100%。测试结果评估显示,除病理号、X线片号、89岁以上的年龄以外,其他PHI类型的F值均超过95%,综合F值达到98.72%。结论本研究基于大规模多样化临床文本数据,利用机器学习的方法,实现了高效的自动化临床文本去识别。对于在保护模型的基础上开发更高效的健康大数据去识别算法以及保证去识别技术的一般性和可扩展性的研究还有待探索。
Objectives In order to achieve the sharing and utilization of clinical information, and promote the development of clinical medicine research based on evidence, and protect the patient’s privacy information effectively at the same time, the effective identification of clinical privacy information in the text is very important. Methods This research adopted 18350 discharge summaries from the municipal area population health information platform in China’s Sichuan province. This paper proposed a method based on CRF model to identify various protected health information. Results The total number of protected health information by manual annotation was 32210. After the audit to the inconsistent annotation, the consistency changed from 92.7% to 100%. Except for pathology number, X-ray number and age more than 89-year-old, F-measure of other types were more than 95%, comprehensive F-measure reached 98.72%. Conclusion This research was based on machine learning method and adopted large-scale diversified data. And we realized effective identification of clinical privacy information in the text. The future direction of the automated clinical text de-identification research includes: Ensuring the generality and expansibility of identifying technology; developing the more efficient de-identification algorithm for health big data on the basis of protecting model.
出处
《中国卫生信息管理杂志》
2017年第2期217-222,共6页
Chinese Journal of Health Informatics and Management
基金
中央高校基本科研业务费资助项目:区域医疗机构知识网络形成机制研究(项目编号:2015AE017)