摘要
Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)是一种适合在通用硬件上运行的低成本、高度容错性的分布式文件系统,能提供高吞吐量的数据访问,适合针对大规模数据集上的应用。然而,HDFS中还面临一些性能优化问题,如负载均衡不足。虽然Hadoop系统自带的负载均衡器可以实现均衡调整,但需要用户预先给出静态的阈值。为了解决阈值的固定性和主观性,通过对磁盘空间使用率、CPU利用率、内存利用率、磁盘I/O占用率、网络带宽占用率等参数的分析评估优化,形成对阈值的计算表达式,并通过理论分析和仿真实验对阈值的计算和负载均衡进行验证。实验结果表明,相比较Hadoop静态的输入阈值的算法,该方法达到了更好的平衡效果,提高了计算资源的利用率。
Hadoop Distributed File System(HDFS)is a low-cost, highly fault-tolerant distributed file system that suitable for running on commodity hardware, and offers high-throughput data access for applications on large datasets. However,there are some performance optimization problems in HDFS, such as under-load balancing. Although Hadoop system comes with a load balancer to achieve balanced adjustment, but users need to give a static threshold in advance. In order to solve the fixed threshold and subjectivity, through the analysis, evaluation and optimization of disk space utilization,CPU utilization, memory utilization, the disk I/O occupancy rate, the network bandwidth occupancy rate and other parameters, this paper forms a calculating expression for a threshold, and through the theoretical analysis and simulation experiments, this paper verifies the threshold calculation and load balancing. The experimental results show that this method achieves a better balance effect and improves the utilization of computing resources compared with the Hadoop static input threshold algorithm.
作者
吴瑶瑶
杨庚
WU Yaoyao;YANG Geng(College of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing 210023, China)
出处
《计算机工程与应用》
CSCD
北大核心
2019年第10期67-72,224,共7页
Computer Engineering and Applications
基金
国家自然科学基金(No.61572263
No.61502251
No.61502243)
江苏省高校自然科学研究项目(No.14KJB520031)
中国博士后科学基金项目(No.2016M601859)
江苏省自然科学基金面上项目(No.BK20161516)