摘要
Hadoop分布式文件系统(HDFS)在大数据存储中具有优良的性能,适用于处理和存储大文件,但在海量小文件处理时性能显著下降,过多的小文件使得整个系统内存消耗过大。为了提高HDFS处理小文件的效率,改进了HDFS的存储方案,提出了海量小文件的存储优化方案。根据小文件之间的相关性进行分类,然后将同一类小文件合并上传,并生成索引文件,读取时采用客户端缓存机制以提高访问效率。实验结果表明,该方案在数据迅速增长的情况下能有效提高小文件访问效率,降低系统内存开销,提高HDFS处理海量小文件的性能。
The Hadoop distributed file system (HDFS) has excellent performance in the big data storage and is suitable for processing and storing big files, but when processing the mass small files the performance reduced significantly, too many small files consume excessive amount of memory.In order to improve the efficiency of processing small files in HDFS, this paper improved the HDFS storage solution, and proposed an optimization scheme.First, it Classified the small files according to the correlation, a set of correlated files is combined into a large file then stored in HDFS, and generate the index file, using client-side caching mechanism to improve the efficiency of access.The experimental results show that the proposed scheme can improve the store and access efficiency effectively with rapiding growth of small files, and reduce memory consumption, improve the performance of processing mass small files.
出处
《计算技术与自动化》
2017年第3期134-138,共5页
Computing Technology and Automation
基金
陕西省网络计算与安全技术重点实验室资助项目(15JS078)
西安市科技计划资助项目(CXY1518(1))