面向大规模计算系统的Cache式并行检查点被引量：1

Cache-style Parallel Checkpointing for Large-scale Computing System

下载PDF

导出

摘要检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。 Checkpointing is a typical technique for fault tolerance,whereas its scalability is limited by the overhead of file access.According to the multi-level file system architecture,the cache-style parallel checkpointing was introduced,which translates global coordinated checkpointing into local file operation by out-of-order pipelining of checkpoint flushing opportunity.The overhead of write-back is hidden effectively to increase the performance and the scalability of parallel checkpointing.

作者刘勇燕刘勇鹏冯华迟万庆

机构地区科学技术部信息中心国防科学技术大学计算机学院

出处《计算机科学》 CSCD 北大核心 2011年第5期287-289,F0003,共4页 Computer Science

基金高效能服务器和存储技术国家重点实验室开放基金项目(2009HSSA04)资助

关键词 Cache式检查点并行计算多级文件系统多处理器乱序流水线 Cache-style checkpointing Parallel computing Multi-level file system Multi-processor Out-of-order pipeline

分类号 TP338.4 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献11

1Gibson G,Schroeder B, Digney J. Failure Tolerance in Petascale Computers[J]. CTWatch Quarterly, 2007,3(4) :4-10.
2Elnozahy E N,Alvisi L,David B,. et al. A Survey of Rollbackrecovery Protocols in Message-passing Systems[J]. ACM Computing Surveys, 2002,34 (3) : 375-408.
3Lawrence Berkeley National Laboratory. Berkeley Lab Check point-Restart (BLCR)[EB/OL]. https://ftg, lbl. gov/Cbeck pointRestart/, Jun 2010.
4刘勇鹏,王小平,李根.用户指导的多层混合检查点技术及性能优化[J].计算机应用研究,2008,25(7):2097-2099. 被引量：2
5The 35th Top500 last[EB/OL]. http://www, top500, org, 2010.
6Franek Cappello INRIA. Fault Tolerance & PetaScale Systems: Current Knowledge, Challenges and Opportunities [C]//EuroPVM/MPI 2008. LNCS 5205. Berlin Heidelberg: Springer Verlag, 2008.
7Plank J S, Li Kai, et al. Diskless Checkpointing. IEEE Transaction on Parallel and Distributed Systems[J].1998,9 (10):972- 986.
8Chen gi-zhong,Fagg G E,Gabril E,et al. Building Fault Survivable MPI Programs with FT MPI Using Diskless Checkpointing [C] // Proceedings ofthe 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP). Chicago, Illinois, 2005 : 213-223.
9Vaidya N H. A Case for Two Level Distributed Recovery Schemes[C]//ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Ottawa, Canada, 1995 : 65- 73.
10Hwang K, Hai Jin,Chow E, et al. Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space[J]. IEEE Concurreycy. 1999,7(1):60-69.

二级参考文献9

1DUELL J, HARGROVE P, ROMAN E. The design and implementation of berkeley lab' s Linux checkpoint/restart, LBNL- 54941 [ R]. [ S. l. ] :Lawrence Berkeley National Laboratory ,2003.
2ROMAN E. A survey of checkpoint/restart implementations, LBNL- 54942 [ R]. [ S. l. ] :Lawrence Berkeley National Laboratory,2002.
3ZHONG Hua, NIEN J. Crak: Linux checkpoint/restart as a kernel module, CUCS-014-01 [ R]. New York:Columbia University,2001.
4PINHEIRO E. EPCKPT [ EB/OL ]. ( 2002- 23- 09 ) [ 2006 ]. http :// www. research. rutgers.edu/-edpin/epekpt.
5PLANK J, BECK M, KINGSLEY G, et al. Libckpt: transparent checkpointing under UNIX [ C ]//Proc of USENIX Technical Conference. Berkeley..USENIX Association, 1995:213-223.
6LITZKOW M, TANNENAUM T, BASNEY J, et al. Checkpoint and migration of UNIX Processes in the Condor distributed processing system, CS-TR-199701346 [ R ]. Madison: University of Wisconsin, 1997.
7Intel Corporation. Intel hanium architecture software developer' s manual volume 2: system architecture revision 2.2, SKU 245318. 005 [ R]. [S. l. ] :Intel Corporation,2006.
8MOSBERGER D,ERANIAN S. IA-64 Linux kernel design and implementation [ M ]. New Jersey :Prentice Hall PTR, 2002 : 93-116.
9LI Peng, WANG Dong-sheng. Process checkpointingand rolIback recovery on the IA- 64 architecture [ R ]. Beijing: Tsinghua University, 2002.

共引文献1

1刘勇鹏,王锋,卢凯,刘勇燕.面向异构并行计算系统的流水线式压缩检查点[J].电子学报,2012,40(2):223-229. 被引量：2

同被引文献13

1McNairy C, Bhatia R. Montecito : A Dual-Core, Dual-Thread Itanium Processor [ J ]. IEEE Trans on Micro, 2005, 25 (2) : 10-20.
2Lee Trongyen, Fan Yanghsin, Cheng Yumin, et al. Hardware Oriented Partition for Embedded Multiprocessor FPGA System[ C ] //Proceedings of the 2the International Conference on Innovative Computing, Information and Control Kumamoto, Japan, 2007.
3Toledo F, Martinez J, Ferrandez J. FPGA-Based Platform for Image and Video Processing Embedded Systems [ C ]//Proceedings of the 2007 3rd Southern Conference on Programmable Logic, 2007 : 171-176.
4Li Yong, Wang Zhiying, Zhao Xuemi, et al. Designg of a Low-Power Embedded Processor Architecture Using Asynchronous Function Units[R]. Lecture Notes In Computer Science, 2008:354-363.
5STM32F103x Datasheet [ OL] .http ://www.stmicroelectronics.com.cn/.
6殷进勇,顾国昌.允许多处理机故障的实时任务容错调度算法[J].电子与信息学报,2010,32(2):444-448. 被引量：6
7全巍,文梅,伍楠,杨乾明,张春元.高性能异构多处理器平台及其应用[J].计算机工程与科学,2011,33(1):60-65. 被引量：4
8诸国磊,王英民,孟荻.通用片上网络多处理器系统研究[J].小型微型计算机系统,2011,32(3):536-539. 被引量：2
9王珊,王会举,覃雄派,周烜.架构大数据:挑战、现状与展望[J].计算机学报,2011,34(10):1741-1752. 被引量：616
10张云泉,袁国兴,孙家昶,张林波.中国高性能计算机TOP100十周年回顾与展望[J].计算机工程与科学,2012,34(8):11-16. 被引量：15

引证文献1

1李哲,慕德俊,郭蓝天,黄兴利,李刘涛.嵌入式多处理器系统混合调度机制的研究[J].西北工业大学学报,2015,33(1):50-56. 被引量：6

二级引证文献6

1宋成章,李哲,张天凡,何传邦,何振.基于OpenCV3的运动对象识别和跟踪研究[J].福建电脑,2015,31(10):19-20.
2何振,陈祥,游继安,陈醇.基于WCF的短信服务系统设计与实现[J].福建电脑,2015,31(10):21-21.
3刘桂涛,李哲,张凯兵,张天凡.基于多Agent战场仿真系统研究[J].现代防御技术,2016,44(4):144-152. 被引量：1
4廖文献,黄兴利.基于Cortex嵌入式多处理器系统的图像中值滤波算法并行化的研究[J].计算机系统应用,2017,26(2):168-173. 被引量：1
5严倩倩,王建国,黄姝娟.支持并行处理的M5模拟器的研究与分析[J].信息与电脑,2020,32(9):121-123.
6安书董,索晓杰,李亚锋,王昭.基于有限状态机的双端口存储器访问竞争处理方法[J].信息通信,2020(8):130-132. 被引量：1

1朱诗生,张惠珍.人机交互软件界面设计[J].信息技术,2009,33(5):36-39. 被引量：12
2优选网络服务器──HA、HS和HR　[J].信息化建设,2000,0(6):43-43. 被引量：2
3李济川,安虹,任永青,从明.基于FPGA的计算机系统原型仿真平台设计[J].计算机仿真,2008,25(5):279-282.
4十大云计算厂商[J].互联网周刊,2009(11):50-50.
5武林平,罗红兵,刘勇鹏.大规模计算系统故障特征及容错机制分析[J].计算机工程与科学,2009,31(A01):237-240. 被引量：3
6武林平,罗红兵,艾志玮,沈岳.大规模计算系统的主动故障管理方法[J].华中科技大学学报（自然科学版）,2010,38(S1):20-24. 被引量：5
7廖鑫.一种基于LRU算法改进的缓存方案研究与实现[J].电子工程师,2008,34(7):46-48. 被引量：2
8刘勇鹏,王锋,卢凯,刘勇燕.面向异构并行计算系统的流水线式压缩检查点[J].电子学报,2012,40(2):223-229. 被引量：2
9黄光奇,周兴铭.单芯片多处理器的性能优势[J].计算机工程与科学,2001,23(1):35-38. 被引量：11
10计算机科学技术——计算机科学技术基础学科[J].中国学术期刊文摘,2007,13(2):203-209.

计算机科学

2011年第5期

浏览历史

内容加载中请稍等...

面向大规模计算系统的Cache式并行检查点被引量：1

参考文献11

二级参考文献9

共引文献1

同被引文献13

引证文献1

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

面向大规模计算系统的Cache式并行检查点 被引量：1

参考文献11

二级参考文献9

共引文献1

同被引文献13

引证文献1

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

面向大规模计算系统的Cache式并行检查点被引量：1