摘要
分布式文件系统(HDFS)是海量数据的主要存储方式。HDFS默认的存储策略中,采用固定的数据副本个数以及随机选择远端节点的策略来保证数据的本地性和安全性,但当系统发生故障需要进行数据恢复时,默认策略将会造成系统时间的损失和节点存储负载的不均衡。提出一种改进的HDFS存储策略,根据节点的失效率以及期望的数据可用性建立一种概率模型,通过模型来优化数据副本个数,并根据节点的评价系数来选择远端节点进行副本存储。实验结果表明该策略针对海量数据时提高系统的存储性能。
HDFS is the main storage method of massive data. In the default storage strategy, HDFS uses fixed data replica and randomly chooses remote node to ensure data locality and security. However, when system needs data recovery because of system fault, random strategy will cause loss of system time and imbalance of node storage load. In this case, puts forward an improved HDFS storage strategy to generate a probability model based on node failure rate and expected data availability. It can optimize the number of data replica by the model and select remote node as copy storage according to its evaluation coefficient. The experiment result shows the strategy can improve system performance aiming at massive data.
基金
广州市科技项目(No.2014XYD-007)