摘要
为了解决HDFS(Hadoop distributed file system)在存储海量小文件时遇到的NameNode内存瓶颈等问题,提高HDFS处理海量小文件的效率,提出一种基于小文件合并与预取的存取优化方案。首先通过分析大量小文件历史访问日志,得到小文件之间的关联关系,然后根据文件相关性将相关联的小文件合并成大文件后再存储到HDFS。从HDFS中读取数据时,根据文件之间的相关性,对接下来用户最有可能访问的文件进行预取,减少了客户端对NameNode节点的访问次数,提高了文件命中率和处理速度。实验结果证明,该方法有效提升了Hadoop对小文件的存取效率,降低了NameNode节点的内存占用率。
In order to solve the problem of NameNode memory bottleneck when HDFS stored a massive amount of small files, this paper proposed an optimization of massive small files storage and accessing on HDFS to improve the efficiency of HDFS. First, it could get the relationship between small files by analyzing a large number of history access logs, and then merged these correlative small files into a big file which would be stored on HDFS. When the client read data from HDFS, the system would prefetch the related files which were most likely to be visited next according to the relevance of small files to reduce the number of request for NameNode,' thereby increasing the hit rate and processing speed. The results of experiment show that this method can effectively improve the efficiency of storing and accessing mass small files on HDFS, and cuts down the memo- ry utilization of NameNode.
出处
《计算机应用研究》
CSCD
北大核心
2017年第8期2319-2323,共5页
Application Research of Computers
基金
国家自然科学基金资助项目(11271057
61640211)
江苏省普通高校研究生科研创新计划项目(SCZ1412800004)
关键词
海量小文件
文件相关性
合并
预取
massive small files
relationship between files
merge
prefetch