摘要
随着互联网技术的发展,数据爆炸性增长,互联网的实际应用也已广泛依赖于海量数据的存储。实际的互联网应用往往需要存储多种类型数据,根据数据被访问频率差异可以将数据划分为冷热数据。然而,现有的编码存储机制往往只能采用固化的实现机制,无法适应多种数据类型的编码存储,导致存储系统性能(如数据访问时间)恶化。考虑到冷热存储数据的不同,提出一种基于多数据编码机制的存储系统框架。对于冷数据,该框架可以采用冗余度较低的编码,从而提高空间利用率;对于热数据,该框架可以采用解码速度较快的编码,从而提高数据访问速度。基于HDFS-RAID设计了这一框架并将之实现为真实系统,实际部署在一个Hadoop集群中。另外基于一个实际系统的用户数据轨迹,在搭建的集群中进行了试验,实验结果表明该框架可以满足不同类型数据同时高效存取的需求,并对编码机制具备高扩展性。
With the rapid development of the Internet and the explosive growth of data,large-scale distributed storage systems are widely used in Internet application. Recent Internet applications usually involve different types of data,and data can be considered as hot data or cold data based on their access frequency. However,a storage system with erasure codes is generally implemented with a fixed coding mechanism,which cannot adapt well to the diverse types of data coexisting in the same system. As a result,the system performance may greatly degrade. Thus,a new storage system framework is suggested to improve the system performance based on multiple codes,considering the difference between hot and cold data. For cold data,it can adopt a low-redundancy coding mechanism to improve space efficiency. For hot data,in contrast,it can reduce the data access time by taking a code that can be rapidly decoded. Then,real-world implementations of such a framework based on HDFS-RAID are designed,which is deployed in a Hadoop tested cluster.Besides,based on a real-world data access trace,the effectiveness of our system in improving the system performance is verified. The results show that the system can adapt well to the diverse types of data.
出处
《计算机应用与软件》
2017年第2期35-41,共7页
Computer Applications and Software
基金
国家自然科学基金项目(61571136)
上海市"科技创新行动计划"项目(14511101000)
综合业务网理论及关键技术国家重点实验室开放研究课题(ISN15-08)