基于并发性发掘的低开销回卷恢复实现方法

Implementation Method of Low Overhead Rollback Recovery Based on Concurrency Exploiting

下载PDF

导出

摘要现有的回卷恢复容错技术存在同步约束和阻塞问题,其时间开销随系统节点规模的增大而剧增。为此,提出一种基于并发性发掘的低开销回卷恢复实现方法。利用消息传递附带跟踪消息依赖的策略解除消息日志中的同步约束,解析进程负载以发掘进程负载的并发性,构建进程负载并发执行的实现架构,采用数据缓存策略和多线程技术实现进程内部各负载的并发执行,以降低故障恢复开销。3个NAS NPB2.3标准性能检测程序的实验结果表明,该方法可使检查点开销从0.63 s、3.19 s、1.21 s分别降低到0.18 s、0.67 s、0.19 s,日志开销率从13.4%、3.5%、18.3%分别降低到0.7%、0.1%、1.0%。 Existing rollback recovery technologies show that their time overheads increase sharply with the scale of nodes due to synchronization constraints and the sequential execution pattern. Aiming at this problem, this paper proposes an implementation method of low overhead rollback recovery based on concurrency exploiting. It uses the strategy of piggybacking dependency on messages to release the message log synchronization constraints. In addition, the workloads in a process is resolved to exploit their concurrency. Then data buffering strategy and multithreading technology are applied to implement the concurrent execution of various process workloads, leading to a low overhead rollback recovery scheme. Experimental results of three NAS NPB2.3 benchmarks show that the overheads of a checkpoint are decreased from 0.63 s, 3.19 s, 1.21 s to 0.18 s, 0.67 s, 0.19 s respectively, and the overhead ratios of message logging are decreased from 13.4%, 3.5%, 18.3% to 0.7%, 0.1%, 1.0% respectively.

作者袁功彪杨金民白树仁

机构地区湖南大学信息科学与工程学院湖南大学超级计算中心

出处《计算机工程》 CAS CSCD 2013年第11期46-51,共6页 Computer Engineering

基金国家自然科学基金资助项目(61272401 61133005) 湖南省科技计划基金资助重点项目(201GK2003)

关键词回卷恢复时间开销同步约束并发性挖掘消息日志检查点 rollback recovery time overhead synchronization constraint concurrency exploiting message log checkpoint

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1Guptal R, Naik H, Beckman P. Understanding Checkpointing Overheads on Massive-scale Systems: Analysis of the IBM Blue Gene/P System[J]. International Journal of High Performance Computing Applications, 2011, 25(2): 180-192.
2Wang Chao, Mueller F, Engelmann C, et al. Hybrid Full/ Incremental Checkpoint/Restart for MPI Jobs in HPC Environments[C]//Proc. of the 16th International Conference on Parallel and Distributed Systems. Washington D. C., USA: IEEE Press, 2011: 524-533.
3Yang Xuejun, Du Yunfei, Wang Panfeng, et al. FTPA: Suppor- ting Fault-tolerant Parallel Computing Through Parallel Recomputing[J]. IEEE Trans. on Parallel and Distributed Systems, 2009, 20(10): 1471-1486.
4Elnozahy E, Alvisi L, Wang Yimin, et al. A Survey of Rollback Recovery Protocols in Message Passing Systems[J]. ACM Computing Surveys, 2002, 33(3): 375-408.
5Chiu G M, Chiu J F. A New Diskless Checkpointing Approach for Multiple Processor Failures[J]. IEEE Trans. on Dependable and Secure Computing, 2011, 8(4): 481-493.
6Guermouche A, Ropas T, Brunet E. Uncoordinated Check- pointing Without Domino Effect for Send-deterministic MPI Applications[C]//Proc. of IEEE International Parallel & Distributed Processing Symposium. Anchorage, USA: IEEE Press, 2011: 989-1000.
7Wang Rui, Salzberg B, Lomet D B. Log-based Recovery for Middleware Servers[C]//Proc. of ACM International Con- ference on Management of Data. New York, USA: ACM Press, 2007: 425-436.
8Li Yawei, Lan Zhiling. FREM: A Fast Restart Mechanism for General Checkpoint/Restart[J]. IEEE Trans. on Computers, 2011, 60(5): 639-652.
9富弘毅,丁滟,宋伟,杨学军.一种利用并行复算实现的OpenMP容错机制[J].软件学报,2012,23(2):411-427. 被引量：7
10Yang Jinmin, Li Kinfun, Li Wenwei. Trading off Logging Overhead and Coordinating Overhead to Achieve Efficient Rollback Recovery[J]. Concurrency and Computation: Practice and Experience, 2009, 21(6): 819-853.

二级参考文献11

1TOP500 supercomputing site. http://www.top500.org.
2Reed DA, Lu CD, Mendes CL. Reliability challenges in large systems. Future Generation Computer Systems, 2006,22(3):293-302. [doi: 10.1016/j.future.2004.11.015].
3Sorin DJ, Martin MMK, Hill MD, Wood DA. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: Proc. of the lnt'l Syrup. on Computer Architecture (ISCA 2002). Anchorage, 2002. 123-134. [doi: 10.1109/ISCA.2002.1003568].
4Prvulovic M, Zhang Z, Torrellas J. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: Proc. of the Int'l Symp. on Computer Architecture (ISCA 2002). Anchorage, 2002. 111-122. Idol: 10.1109/ ISCA.2002.1003567].
5Dieter WR, Lumpp JE. A user-level eheekpointing library for POSIX threads programs. In: Proe. of the '99 Syrup. on Fault-Tolerant Computing Systems (FTCS'99). Madison, 1999. 224-227. [doi: 10.1109/FTCS.1999.781054].
6Bronevetsky G, Marques D, Pingali K, Szwed P, Schulz M. Application-Level cheekpointing for shared memory programs. In: Proc. of the 1 lth Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2004). New York, 2004.235-247. [doi: 10.1145/1024393.1024421].
7Bronevetsky G, Pingali K, Stodghill P. Experimental evaluation of applicationlevel cheekpointing for OpenMP programs. In: Proc. of the 20th Annual Int'l Conf. on Supercomputing (SC 2006). Cairns, 2006.2-13. [doi: 10.1145/1183401.1183405].
8Bronevetsky G, Marques D, Pingali K, Stodghill P. Ca: A system for automating application-level ch~ckpointing of MPI programs. In: Proc. of the 16th Int'l Workshop on Languages and Compilers for Parallel Computing (LCPC 2003). 2003.
9Yang XJ, Du YF, Wang PF, Fu HY, Jia J, Wang ZY, Suo G. The fault tolerant parallel algorithm: The parallel recomputing based failure recovery. In: Proc. of the 16th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT 2007). Brasov, 2007. 199-212. Idol: 10.1109/PACT.2007.4336212].
10Bailey DH, Harris T, Saphir W, Wijngaart RVD, Woo A, Yarrow M. The NAS parallel benchmarks 2.0. Technical Report, NAS- 95-020, NASA Ames Research Center, 1995.

共引文献6

1刘晓娴,赵荣彩,丁锐,李雁冰.基于循环分块的流水粒度优化算法[J].计算机应用,2013,33(8):2171-2176. 被引量：1
2汪建军,于策,孙济洲,孙超,金舟,成钢.P2P多线程动态容错模型的研究与应用[J].计算机工程,2013,39(9):104-108.
3刘晓娴,赵荣彩,丁锐.面向DSWP并行的OpenMP任务调度机制的扩展与实现[J].计算机科学,2013,40(9):38-43. 被引量：2
4曾喜良,彭浩.容错机制的异构分布式系统安全可靠调度研究[J].网络安全技术与应用,2015(7):61-62.
5刘洋,杨金民.OpenMP程序中基于活跃变量分析的检查点优化[J].计算机工程与应用,2016,52(4):31-41.
6葛优,金大海,宫云战.基于OpenMP的并行Fortran程序数据竞争静态检测方法[J].小型微型计算机系统,2023,44(11):2377-2383.

1尹万旺,周明忠,漆锋滨.MPI容错技术综述[J].高性能计算技术,2004,0(6):38-41.
2杨金民,张大方.基于分块消息日志的回卷恢复策略[J].电子学报,2004,32(5):857-859. 被引量：5
3王准,陈俊亮.悲观消息日志法在交换软件中的应用[J].通信学报,2000,21(12):23-29.
4庞丽萍,陈宝利.基于相互独立检查点的MPI消息日志系统[J].华中科技大学学报（自然科学版）,2004,32(8):57-59.
5李凯原,杨孝宗.减少检查点开销的一种方法[J].计算机工程与应用,2000,36(2):4-5. 被引量：1
6王准,陈俊亮.消息日志方法中的不确定性问题[J].计算机学报,1998,21(8):730-737.
7王继刚,顾国昌,谢世波,徐立峰.一种面向服务的快速透明故障转移策略[J].计算机研究与发展,2006,43(z2):575-582.
8姚建盛,刘艳玲.移动计算环境下可配置的卷回恢复算法[J].长春工程学院学报（自然科学版）,2009,10(4):92-95.
9崔萌,袁海,史耀馨,李宣东,郑国梁.一种基于MDA的UML顺序图到状态图的转换方法[J].南京大学学报（自然科学版）,2004,40(4):470-482. 被引量：4
10杨安宁,周莹,吕康.一种Android平台GIS软件的新型数据缓存策略的实现[J].计算机与现代化,2012(10):46-49. 被引量：5

计算机工程

2013年第11期

浏览历史

内容加载中请稍等...

基于并发性发掘的低开销回卷恢复实现方法

参考文献13

二级参考文献11

共引文献6

相关作者

相关机构

相关主题

浏览历史