期刊文献+

可恢复的软件DSM系统JIACKPT

JIACKPT: A Recoverable Software Distributed Shared Memory System
下载PDF
导出
摘要 软件 DSM(distributed shared memory)系统在机群上构造了共享存储编程环境,结合了共享存储的易编程性和机群的可扩展性,引起了广泛的研究.由于软件 DSM 系统是一个分布式系统,系统失败风险大,需要实现容错技术以促进其实用化.利用用户级检查点技术,在支持域存储一致模型的软件 DSM 系统 JIAJIA 的基础上,设计并实现了一个可恢复的高可移植的软件 DSM 系统 JIACKPT(JIAjia with ChecKPoinTing).由于采用适合软件 DSM 系统的强全局一致状态以及多种优化措施,JIACKPT 易于实现且获得很好的性能.在一个 8 节点的 PC 机群上的应用测试表明,即使每分钟做一次检查点,大部分应用的检查点开销也小于 10%.此外,JIACKPT 还具有高可移植性.这些都表明 JIACKPT 已经成为一个比较实用的系统. Software distributed shared memory (DSM) system has constructed a virtual shared memory abstract on cluster, which combines the programmability of shared memory and fine scalability of cluster. So it is widely studied. Software DSM system is easy to fail because it is a distributed system, some kinds of fault tolerance are necessary for it to be more practical. A recoverable and portable software DSM system, JIACKPT (JIAjia with ChecKPoinTing), has been designed and implemented to tolerate the fault of system. JIACKPT, based on JIAJIA, has adopted the checkpointing technology. By maintaining the strict global consistent state and using some optimization techniques, JIACKPT has gotten high performance. The experimental results on an 8-node PC cluster show that the checkpoint overhead is less than 10% of the whole execution time when checkpoint is done once per minute. JIACKPT also has good portability and can run on several operating systems, such as Linux, Solaris, etc. JIACKPT is a practical recoverable software DSM system.
出处 《软件学报》 EI CSCD 北大核心 2005年第2期165-173,共9页 Journal of Software
基金 国家自然科学基金~~
关键词 软件DSM系统 检查点 全局一致状态 JIAJIA Computer software Computer workstations Data communication systems Fault tolerant computer systems
  • 相关文献

参考文献14

  • 1唐志敏,施巍松,胡伟武.曙光1000A上消息传递与共享存储的比较[J].计算机学报,2000,23(2):134-140. 被引量:13
  • 2胡明昌,史岗,胡伟武,唐志敏,张福新.PC机群上JIAJIA与MPI的比较[J].软件学报,2003,14(7):1187-1194. 被引量:5
  • 3Kermarrec AM, Cabillic G, Gefflaut A, Morin C, Puaut I. A recoverable distributed shared memory system integrating coherence and recoverability. In: Proc of the 25th Int'l Symp. on Fault-Tolerant Computing. Washington DC: IEEE computer Society, 1995.289-298.
  • 4Angkul K, Santipong T, Tzeng NF. Coherence-Based coordinated checkpointing for software distributed shared memory systems.In: Proc of the 20th Int'l Conf on Distributed Computing Systems. Washington DC: IEEE computer Society, 2000. 556-563.
  • 5Jerzy B, Michal S. An extended coherence protocol for recoverable DSM systems with causal consistency. In: Proc of the Infl Conf on Computational Science. 2004.475-482. http://www.springerlink.com/index/YVA5RPUQDQQWSQT0.pdf.
  • 6Plank JS, Beck M, Kingsley G, Li K. Libckpt: Transparent checkpointing under Unix. In: Proc of the USENIX 1995 Technical Conference. 1995.213-223. http://www.cs.utk.edu/-plank/plank/papers/usenix-95w.html.
  • 7Chandy KM, Lamport L. Distributed snapshots: Determining global states of distributed systems. ACM Trans. on Computer Systems, 1985,3(1):63-75.
  • 8Woo SC, M. Ohara M, Torrie E, Singh JP, Gupta A. The SPLASH-2 programs: Characterization and methodological considerations.In: Proc. of the 22th Annual Symp. on Computer Architecture. New York : ACM Press, 1995.24-36.
  • 9Bershad BN, Zekauskas MJ, Sawdon WA. The midway distributed shared memory system. In: Proc. of the 38th IEEE Computer Society Int'l Conf. 1993. 528-537. http://www.cs.cmu.edu/afs/cs/project/midway/WWW/CompCon93.ps.
  • 10Keleher P, Cox AL, Dwarkadas S, Zwaenepoel W. TreadMarks: Distributed shared memory on standard workstations and operating systems. In: Proc. of the 1994 Winter Usenix Conf. 1994. 115-131. http://www.cs.rice.edu/-willy/papers/wusenix94.ps.gz.

二级参考文献7

  • 1李明,张玉敏,唐志敏.SMP系统上两种并行机制的比较[J].计算机工程与科学,1996,18(3):9-15. 被引量:3
  • 2Hu W,Lecture Notes inComputer Science15 93 ,Springer,1999年,463页
  • 3Hu W,J Comput Sci Technol,1998年,13卷,2期,97页
  • 4Lu H,J Parallel Distributed Computing,1997年,43卷,2期,65页
  • 5Lu H,Pro-ceedings of Supercomputing’ 95 ,San Diego,1995年
  • 6Li K,Proceedings of 1988Int Conferenceon Parallel Processing St Charles IL,1988年,2卷,94页
  • 7唐志敏,施巍松,胡伟武.曙光1000A上消息传递与共享存储的比较[J].计算机学报,2000,23(2):134-140. 被引量:13

共引文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部