摘要
摘要在云计算的基础设施———数据中心内,Hadoop分布式文件存储系统(Hadoop Distributed File System,HDFS)以高容错性、高可靠性、高可扩展性的优势被广泛使用.但HDFS中遵循机架感知的存储策略没有考虑数据间的差异性和使用频度,所有数据以相同副本数复制后分散存储在不同的DataNode节点中,这势必会开启过多的DataNode而导致数据中心能耗过高.针对这一问题,突破现有HDFS对数据块的恒定副本个数存储的限制,提出保证数据块可用性的可变副本存储策略.建立了分布式文件存储超图模型,数学表述了数据块、文件和DataNode间的多对多关系.基于模型提出一种■横贯超边计算方法实现数据中心HDFS可变■重极小覆盖集选择,从而确定保证数据可用性的最小数量DataNode开启集合,实现数据中心存储单元节能.在原问题的可行域中会存在多个最优解的情况,即在满足数据块■覆盖的条件下,存在开启DataNode数目最少且相等的多种方案,因此该问题是一个多态函数优化问题,该文提出采用贪心萤火虫算法加以求解.算法性能测试实验通过Hadoop环境下的WordCount、TeraSort和Grep三种典型计算实例运算实验,进行了数据可用性实验,HDFS集群存储负载均衡实验,集群能耗分析以及数据中心网络性能试验.实验结果表明,可变■数据副本最小覆盖集算法在保证数据块和文件可用的条件下,可以实现更少的DataNode开启,有效节省HDFS集群能耗,并且通过开启DataNode的合理配置,缓解了网络传输拥塞.
In Data center, as the infrastructure of Cloud, Hadoop Distributed File System (HDFS) have been widely used for handling large amounts of data due to their excellent performance in terms of fault tolerance, reliability and scalability. Large size of files stored in the HDFS - based datacenter are split into a number of small size of data blocks, and the default size of each data block is 64M. In order to improve the reliability of data blocks, HDFS creates multiple replicas for each data block in the datacenter. The replicas and the original data blocks will be stored in different data nodes according to the rack-aware storage strategy. With this strategy, if any kind of failure happens to a data node, the availability of data hosted on this physical machine can be guaranteed since its replicas can still be retrieved from other data nodes. However, these storage systems usually adopt the same replication and storage strategy to guarantee data availability, i.e. creating the same number of replicas for all data sets and randomly storing them across data nodes. Such strategies do not fully consider the difference requirements of data availability on different data sets. More servers than necessary should thus be used to store replicas of rarely-used data, which will lead to increased energy consumption. With the increasing number of datacenters built around the world to maintain cloud computing capabilities, huge amount of electricity bills have to face. To address this issue, this paper studies the HDFS differential storage energy-saving optimal algorithm applying in Cloud Data center. Breaking through the limitation of the constant number of replicas in existing storage methods, we propose a variable number of active replicas storage strategy for each data block according to user requirements of data availability. Firstly, this paper develops a novel hypergraph-based storage model for Cloud data centers, which can precisely represent the many-to-many relationship among files, data blocks, data racks, and data nodes. Based on the hypergraph-based storage model, a κ-transverse hyperedge algorithm is proposed to calculate the minimum set of data nodes variable κ covering. Because of just running the minimum number of required data nodes, it can not only save energy for the datacenter, but also maintain full functionality. Analyzing this optimal problem, there is more than one optimal solution in the feasible region. That is, there are multi-solutions with the minimum and equal number of active data nodes to satisfy the data blocks κ-coverage constraints. It is a polymorphic function optimizal problem, and this paper proposed a greedy firefly algorithm to solve it. We have also implemented our proposed algorithm in a HDFS based prototype datacenter with WordCount, TeraSort, and Grep cloud computing cases for performance evaluation, and the four different aspects, namely, data availability, load balance, energy consumption and network performance of the data center are analyzed. Experimental results show that the variable hypergraph coverage based strategy can not only reduce energy consumption with less number of data nodes active, but can also relieve the delivery congestion problem in data center network.
作者
杨挺
王萌
张亚健
赵英杰
盆海波
YANG Ting;WANG Meng;ZHANG Ya-Jian;ZHAO Ying-Jie;PEN Hai-Bo(School of Electrical and Information Engineering, Tianjin University, Tianjin 300072)
出处
《计算机学报》
EI
CSCD
北大核心
2019年第4期721-735,共15页
Chinese Journal of Computers
基金
国家自然科学基金(61571324)
天津市自然科学基金重点项目(16JCZDJC30900)
国家国际科技合作专项(2013DFA11040)资助~~