期刊文献+

大规模集群中一种自适应可扩展的RPC超时机制 被引量:2

Adaptive Scalable RPC Timeout Mechanism for Large Scale Clusters
下载PDF
导出
摘要 在基于RPC(remote produce call)构建的分布式系统中,超时是一种通用的失效检测手段.在超大规模Lustre存储集群的压力测试中,发现传统的固定超时机制会导致很多不必要的超时而存在缺陷.提出了一种综合考虑了网络条件、服务器负载、扩展性和性能等因素的自适应可扩展的RPC超时机制(Adaptive Scalable RPC Timeout mechanism,简称AST).在其控制下,客户端超时值可以根据网络和服务器的拥塞情况动态地调整设置,而且服务器可以通过额外消息传递通知客户端修改原超时值.经过一系列的模拟和验证,其结果表明,AST是一种更适合的RPC失效检测模型,增强了系统的响应性、可靠性和稳定性,而且对系统的性能没有过大的负面影响. Timeouts are usually used for failure detection in RPC (remote produce call) based systems, which are typically reported on a per-call basis. During pressure testing, on a very large cluster system, it has been found that the traditional fixed timeout mechanism leads lots of unnecessary timeouts, especially when the server loading is involved. This paper proposes an Adaptive Scalable RPC Timeout (AST for short) mechanism that considers network conditions, server load, scalability, and performance. Under this control, the timeout value, set by clients, can be adapted and adjusted in a dynamic fashion, according to congestion of the network and the server. Moreover, the server can notify the client to modify the timeout value of the RPC. Via a series of simulations, it has been proved that the AST mechanism is a more suitable failure detection mechanism for RPC models with timeouts, and it enhances the system responsibility, reliability, and stability without negative impact on performance, even for large-scaled cluster systems.
出处 《软件学报》 EI CSCD 北大核心 2010年第12期3199-3210,共12页 Journal of Software
基金 国家自然科学基金No.60736013~~
关键词 远程过程调用 失效检测 超时 大规模 扩展性 响应性 可靠性 RPC (remote produce call) failure detection timeout large scale scalability responsibility reliability
  • 相关文献

参考文献16

  • 1TOP 500 Supercomputers home page. http://www.top500.org.
  • 2Birman KP, Glade BB. Consistent failure reporting in reliable communication systems. Technical Report, TR93-1349, Ithaca: Cornell University, 1993.
  • 3Panzieri F, Shrivastava SK. Rajdoot: A remote procedure call mechanism supporting orphan detection and killing. IEEE Trans. on Software Engineering, 1988,14(1):30-37. [doi" 10.1109/32.4620].
  • 4Muller G, Volanschi EN, Marlet R. Scaling up partial evaluation for optimizing the Sun commercial RPC protocol. ACM SIGPLAN Notices, 1997,32(12):116-126. [doi: 10.1145/258994.259010].
  • 5Bouteiller A, Desprez F. Fault tolerance management for a hierarchical GridR.PC mldd|eware. In: Proe. of the gth [EEE Int'l Symp. on Cluster Computing and Grid (CCGRID 2008). Lyon: IEEE Press, 2008. 484-491. http://icl.es.utk.edulnews_pub/submissions/ bouteiller-FTgridRPC.pdf.
  • 6Welch BB. The sprite remote procedure call system. Technical Report, CSD-87-302, Berkeley: University of California at Berkeley, 1986.
  • 7Tay BH, Ananda AL. A survey of remote procedure calls. ACM SIGOPS Operating Systems Review, 1990,24(3):68-79.
  • 8Frances C, Kao IL, Lin CL. Adaptive timeout value setting for distributed computing environment (DCE) applications. United States Patent 6526433, 2003-02-25. http://www.freepatentsonline.com/6526433.html.
  • 9Khandker AM, Honeyman P, Teorey TJ. Performance of DCE RPC. In: Proc. of the 2nd Int'l Workshop on Services in Distributed and Networked Environments. Whistler: IEEE Computer Society, 1995.
  • 10Delaney WP, Copas KW, Jantz RM, Lewis CW. Polling-Based mechanism for improved RPC timeout handling. United States Pattent 7146427, 2002-04-23. http://www.freepatentsonline.com/7 t 46427.html.

同被引文献8

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部