期刊文献+

基于异步时钟机群监测系统的设计与实现

Design and Implementation of a Cluster Monitor System Based on Asynchronous Clock
下载PDF
导出
摘要 机群监测系统是用来管理机群,方便用户使用机群的软件系统.针对当前已有的机群管理系统在时效性、健壮性等方面的不足,提出了一种基于异步时钟的服务器监测技术,通过把指定的服务器集合组成一个具有反馈机制的环形队列的方法,使得用户能够将机群作为一个整体来进行管理.该系统能够透明地加入和删除服务器节点,自动重新配置以达到高可用性.本文采用的线程池技术和I/O多路转换技术能够有效提高机群监测系统的反应时间.实际的测试结果证明,该系统能够根据传输延迟、丢包率、系统繁忙程度、服务器或网络故障等情况采取适当的对策,在短时间内发现和排除故障,可以较好地应用于邮件服务器、WEB服务器等事务性机群处理系统中. Cluster monitor system is a management software system, which makes the control of the cluster system very easy. At current time, most of the cluster monitor systems have some faults in real-time and robust. In this paper, a new mechanism to improve the cluster monitor system based on asynchronous clock is proposed. It connects the specified servers into a feed- back queue, so the users can treat the cluster system as one. The monitor system can add or delete certain nodes in the cluster system transparently. Furthermore, the new monitor system can reconfigure automatically to achieve high availability. In order to reduce the latency of the cluster monitor system, two technologies are used. One is the thread poll, the other is the I/O multiplexing. The results show that the new cluster monitor system can find an appropriate way to deal with the situations such as latency, ratio of the data losing, CPU utilization, fail of the server or network. The new cluster monitor system is very suited for the cluster system such as mail server, WEB server or other businesslike systems.
出处 《小型微型计算机系统》 CSCD 北大核心 2005年第9期1617-1620,共4页 Journal of Chinese Computer Systems
基金 国家"八六三"重点项目(2001AA11110 2004AA111120)资助.
关键词 机群 异步时钟 成员协议 高可用 远程过程调用 cluster t asynchronous clock membership protocol credibility t remote procedure call
  • 相关文献

参考文献8

  • 1Roberto Baldoni, Fabio Zito. Designing a service of failure detection in asynchronous distributed systems[C]. Fourth International Symposium on Object-Oriented Real-Time Distributed Computing, May 02 - 04, 2001: 113-120.
  • 2Antonio Casimir, Pedro Martins, Luis Rodrigues et al. Measuring distributed durations with stable errors [J]. Real-Time Systems Symposium, 2001 Dec 3-6 22nd IEEE: 310-320.
  • 3Flaviu Cristian. Frank Sehmuek. Agreeing on processor group membership in asynehronous distributed systems[R]. University of California, San Diego, 1995, Technical Report CSE95-428.
  • 4Christof Fetzer. Perfect failure detection in timed asynchronous systems[J]. IEEE Transactions on Computer, 2003, 52(2):99-112.
  • 5Tushar Deepak Chandra, Sam Toueg. Unreliable failure detectors for asynchronous systems [C]. Proceedings of the Tenth Annual ACM Symposium on Principles of Distributed Computing, Aug 1991:325-340.
  • 6Herman T. Phase clocks for transient fault repair parallel and distributed systems[J]. IEEE Transactions on Parallel and Distributed Systems, Oct. 2000,11 (10) : 1048-1057.
  • 7Leslie Lamport. Time, clocks and the ording of event in a distributed system[J]. Communications of the ACM, 1978, 21(7):558-565.
  • 8熊劲,孙凝晖.曙光机群资源管理的设计与实现[J].计算机学报,2002,25(12):1357-1363. 被引量:8

二级参考文献7

  • 1Intel Corporation,Paragon User's Guide,1993
  • 2A1 Geist,Adam Beguelin et al. PVM3 User's Guide And Ref erence Manual, 1994
  • 3IBM Corporation. IBM Parallel System Support Programs forAIX: Administration Guide. 2nd Edition, IBM SP Redbook, 1996
  • 4Goscinski A. Distributed Operating Systems: The Logical Design. Addison-Wesley Publishing Company, 1991
  • 5Dawning3000 Administrator's Guide. National Research Center for Intelligent Computing Systems, P. R. C. , 2001(in Chinese)(曙光3000系统管理员手册.国家智能计算机研究开发中心,内部资料,2001)
  • 6Dawning3000 User's Guide. National Research Center for In telligent Computing Systems, P. R. C. , 2001(in Chinese)(曙光3000用户手册.国家智能计算机研究开发中心,内部资料,2001)
  • 7孙凝晖,徐志伟.曙光2000超级计算机系统软件的设计[J].计算机学报,2000,23(1):9-20. 被引量:11

共引文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部