摘要
随着越来越多的医院开展数字化建设以及区域医疗应用范围的扩大,大量非结构化、半结构化医疗数据爆发式的增长,传统的技术架构在处理海量数据方面显得越来越乏力。深圳市区域卫生信息化数据交换平台,覆盖了全市60家公立医院、600多家社区卫生机构。平台接入近50个异构系统,现有1700多万份健康档案、30亿条以上诊疗数据,平均每天产生500万以上的小文件。针对深圳市卫生区域信息化建设,海量小文件交换处理效率低下的问题,利用Hadoop平台,提出了采用时间基线归档文件技术和序列文件技术解决小文件存储、检索效率问题的解决方案,经验证实该技术可满足实际业务应用中对数据交换的需要。详细描述了该技术的实现细节,包括根据业务数据规模划定时间基线,根据业务需求定制数据类型、数据结构,将小文件合并分块存储,建立小文件到大文件的映射以及相关数据交换处理流程等,并基于真实数据对该技术进行了评测比较,结果表明上述技术与常规技术相比明显提升了批量处理小文件的效率。
As more and more hospitals being digitized and the scope of regional medical applications being expanded, large amounts of unstructured or semi-structured medical data have seen explosive growth, and the traditional technical architecture for handling massive amounts of data has become increasingly weak. At present, the Shenzhen regional health information data exchange platform covers 60 public hospitals and more than 600 community health agencies in Shenzhen. The platform which is accessing nearly 50 heterogeneous systems presently having more than 16 million copies of existing health records and over 3 billion clinic data, generates an average of more than 5 million small files every day. According to the Shenzhen regional health informatization construction and aiming to solve massive small files exchange process inefficiencies, this paper proposed using the archive technologies and techniques based on the time baseline to solve the problems of small files' storage and retrieval based on the Hadoop platform. The technology can meet the needs of practical business application for data exchange. This paper described the implementation details of the technology, including the delineation of the time scale based on business data at baseline, customised data types and data structures according to the business needs, small files' merge and block storage, the establishment of mapping from small files to large files and related data exchange processing, etc. The technical evaluations based on real data were compared, and the results showed that these techniques significantly improved the processing efficiency of massive small files compared with the conventional techniques.
出处
《中国数字医学》
2014年第8期89-92,共4页
China Digital Medicine
基金
基于区域卫生海量医疗数据的实时交互和高效分析处理技术研究(编号:CXZZ20120828161054317)~~
关键词
医疗数据
时间基线
批量小文件
数据集成技术
medical data
time baseline
massive small files
data integration technology