期刊文献+

面向大数据分析的分布式文件系统关键技术 被引量:72

Key Technology in Distributed File System Towards Big Data Analysis
下载PDF
导出
摘要 大数据时代的来临使数据分析和处理能力成为数据中心和互联网公司日益倚重的技术手段.信息规模的扩大和数据结构的多样化,使海量数据存储成为大数据分析研究的热点.传统的分布式文件系统在扩展性、可靠性和数据访问性能等方面难以满足新形势下的需求.设计并实现了一个面向大数据分析、专为大规模集群应用的分布式文件系统Clover.该系统采用基于目录划分和一致性Hash映射的名字空间管理方法,解决了元数据扩展性问题;通过改进的两阶段提交协议,保证了多元数据服务器下分布式元数据操作的一致性;提出了基于共享存储池的高可用机制,通过热备和全局状态恢复机制提高了元数据的可靠性.评测结果表明,Clover的元数据处理能力随服务器的数量线性增长,增加单个服务器的元数据操作性能平均提升了5.13%~159.32%.由于名字空间管理和分布式事务的开销,多元数据服务器会导致复杂操作的性能下降,但是这种下降的幅度很小(小于10%).与HDFS相比,Clover的文件读写带宽与之接近,并能够保证在元数据服务器失效后文件系统快速恢复,适合于构建高可扩展和高可用的存储系统. With the arrival of big data period, data analysis and processing are becoming a more important technology which the data center and Internet companies depend on. Mass data storage is a hotspot topic in big data analysis with the expansion of information and variety of data structure. Traditional distributed file systems are lack of the new demands in scalability, reliability and performance. In this paper, a cluster file system towards big data analysis is designed, which is named Clover. Clover uses the namespace management based on directory sharding and consistent hashing to solve the problem of metadata extension. It provides metadata consistency for distributed transactions through a modified two-phase commit protocol. Moreover, Clover presents a highly available mechanism based on the shared storage pool. It achieves metadata reliability with hot standby and global state recovery mechanism. The evaluation results reveal that Clover could improve metadata performance linearly with the average value from 5.13% to 159.32% by adding one metadata server. Namespace management and distributed transactions would cause the degradation of performance on multiple metadata servers, but the influence is negligible (less than 10%). Comparing with HDFS, Clover could keep the similar throughput and quickly recover from metadata server failures. Practical application tests show that Clover is suitable for building high scalable and high available storage system.
出处 《计算机研究与发展》 EI CSCD 北大核心 2014年第2期382-394,共13页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划基金项目(2013AA013204) 中国科学院先导专项基金项目(XDA06030200) 国家自然科学基金项目(60903047) 国家科技支撑计划基金项目(2012BAH46B03)
关键词 大数据 海量数据存储 分布式文件系统 元数据可扩展性 高可用性 big data mass data storage distributed file system metadata scalability high availability
  • 相关文献

参考文献28

  • 1Sandberg R,Goldberg D,Kleiman S. Design and implementation of the Sun network filesystem[A].Berkeley,CA:USENIX Association,1985.119-130.
  • 2Shvachko K,Kuang H,Radia S. The Hadoop distributed file system[A].Piscataway,NJ:IEEE,2010.1-10.
  • 3White T. Hadoop:The Definitive Guide[M].Cambridge:O'Reilly Media,2009.
  • 4Ghemawat S,Gobioff H,Leung S. The Google file system[A].New York:ACM,2003.29-43.
  • 5Dean J,Ghemawat S. MapReduce:Simplified data processing on large clusters[A].Berkeley,CA:USENIX Association,2004.137-150.
  • 6Schmuck F,Haskin R. GPFS:A shared disk file system for large computing clusters[A].Berkeley,CA:USENIX Association,2002.231-244.
  • 7Weil S A,Brandt S A,Miller E L. Ceph:A scalable,high performance distributed file system[A].Berkeley,CA:USENIX Association,2006.307-320.
  • 8Douceur J R,Howell J. Distributed directory service in the Farsite file system[A].Berkeley,CA:USENIX Association,2006.321-334.
  • 9冯幼乐,朱六璋.CEPH动态元数据管理方法分析与改进[J].电子技术(上海),2010(9):1-3. 被引量:6
  • 10Thusoo A,Sarma J S,Jain N. Hive:A warehousing solution over a map-reduce framework[A].{H}Berlin:Springer-Verlag,2009.1626-1629.

二级参考文献24

  • 1J. Menon, D. A. Pease, R. Rees, et al. IBM storage tank-A heterogeneous scalable SAN file system. IBM Systems Journal,2003, 42(2): 250~267
  • 2P. J. Braam. The Lustre Storage Architecture. Medford, MA:Cluster File Systems, Inc. 2004
  • 3Uresh Vahalia. UNIX Internals: The New Frontiers. Englewood Cliffs, NJ: Prentice-Hall, 1996
  • 4J. Gray. Notes on data base operating systems. In: R. Bayer, R.M. Graham, G. Seegmuller, eds, Operating Systems: An Advanced Course, Lecture Notes on Computer Science 60. New York: Springer-Verlag, 1978. 393~481
  • 5T. Haerder, A. Reuter. Principles of transaction-oriented database recovery. ACM Computing Surveys, 1983, 15(4): 287~317
  • 6M.K. McKusick, T. J. Kowalski. FSCK-The UNIX file system check program. In: 4.4 BSD System Manager's Manual.Sebastopol : O ' Reilly , 1994
  • 7S. Tweedie. Journaling the Linux ext2fs file system. The 4th Annual LinuxExpo, Durham, 1998
  • 8J. Gray, A. Reuter. Trans. Processing: Concepts and Techniques. New York: Morgan Kaufman, 1993
  • 9G. Ganger, M. McKusick, C. Soules, et al. Soft updates: A solution to the metadata update problem in file systems. ACM Trans. Computer Systems, 2000, 18(2): 127~153
  • 10L. Soares, O. Krieger, D. Silva. Meta-data snapshotting: A simple mechanism for file system consistency. Int'l Workshop on Storage Network Architecture and Parallel I/O s held with 12th Int'l Conf. Parallel Architectures and Compilation Techniques,New Orleans, 2003

共引文献15

同被引文献771

引证文献72

二级引证文献449

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部