大规模集群中一种自适应可扩展的RPC超时机制被引量：2

Adaptive Scalable RPC Timeout Mechanism for Large Scale Clusters

下载PDF

导出

摘要在基于RPC(remote produce call)构建的分布式系统中,超时是一种通用的失效检测手段.在超大规模Lustre存储集群的压力测试中,发现传统的固定超时机制会导致很多不必要的超时而存在缺陷.提出了一种综合考虑了网络条件、服务器负载、扩展性和性能等因素的自适应可扩展的RPC超时机制(Adaptive Scalable RPC Timeout mechanism,简称AST).在其控制下,客户端超时值可以根据网络和服务器的拥塞情况动态地调整设置,而且服务器可以通过额外消息传递通知客户端修改原超时值.经过一系列的模拟和验证,其结果表明,AST是一种更适合的RPC失效检测模型,增强了系统的响应性、可靠性和稳定性,而且对系统的性能没有过大的负面影响. Timeouts are usually used for failure detection in RPC （remote produce call） based systems, which are typically reported on a per-call basis. During pressure testing, on a very large cluster system, it has been found that the traditional fixed timeout mechanism leads lots of unnecessary timeouts, especially when the server loading is involved. This paper proposes an Adaptive Scalable RPC Timeout （AST for short） mechanism that considers network conditions, server load, scalability, and performance. Under this control, the timeout value, set by clients, can be adapted and adjusted in a dynamic fashion, according to congestion of the network and the server. Moreover, the server can notify the client to modify the timeout value of the RPC. Via a series of simulations, it has been proved that the AST mechanism is a more suitable failure detection mechanism for RPC models with timeouts, and it enhances the system responsibility, reliability, and stability without negative impact on performance, even for large-scaled cluster systems.

作者钱迎进肖侬金士尧

机构地区国防科学技术大学计算机学院并行与分布处理国家重点实验室

出处《软件学报》 EI CSCD 北大核心 2010年第12期3199-3210,共12页 Journal of Software

基金国家自然科学基金No.60736013~~

关键词远程过程调用失效检测超时大规模扩展性响应性可靠性 RPC （remote produce call） failure detection timeout large scale scalability responsibility reliability

分类号 TP316 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献16

1TOP 500 Supercomputers home page. http://www.top500.org.
2Birman KP, Glade BB. Consistent failure reporting in reliable communication systems. Technical Report, TR93-1349, Ithaca: Cornell University, 1993.
3Panzieri F, Shrivastava SK. Rajdoot: A remote procedure call mechanism supporting orphan detection and killing. IEEE Trans. on Software Engineering, 1988,14(1):30-37. [doi" 10.1109/32.4620].
4Muller G, Volanschi EN, Marlet R. Scaling up partial evaluation for optimizing the Sun commercial RPC protocol. ACM SIGPLAN Notices, 1997,32(12):116-126. [doi: 10.1145/258994.259010].
5Bouteiller A, Desprez F. Fault tolerance management for a hierarchical GridR.PC mldd|eware. In: Proe. of the gth [EEE Int'l Symp. on Cluster Computing and Grid (CCGRID 2008). Lyon: IEEE Press, 2008. 484-491. http://icl.es.utk.edulnews_pub/submissions/ bouteiller-FTgridRPC.pdf.
6Welch BB. The sprite remote procedure call system. Technical Report, CSD-87-302, Berkeley: University of California at Berkeley, 1986.
7Tay BH, Ananda AL. A survey of remote procedure calls. ACM SIGOPS Operating Systems Review, 1990,24(3):68-79.
8Frances C, Kao IL, Lin CL. Adaptive timeout value setting for distributed computing environment (DCE) applications. United States Patent 6526433, 2003-02-25. http://www.freepatentsonline.com/6526433.html.
9Khandker AM, Honeyman P, Teorey TJ. Performance of DCE RPC. In: Proc. of the 2nd Int'l Workshop on Services in Distributed and Networked Environments. Whistler: IEEE Computer Society, 1995.
10Delaney WP, Copas KW, Jantz RM, Lewis CW. Polling-Based mechanism for improved RPC timeout handling. United States Pattent 7146427, 2002-04-23. http://www.freepatentsonline.com/7 t 46427.html.

同被引文献8

1周明中,龚俭,丁伟.网络流超时策略研究[J].通信学报,2005,26(4):88-93. 被引量：10
2曹哲,尤政.超时策略动态阈值的阈值选择影响因素[J].哈尔滨工业大学学报,2013,45(6):119-123. 被引量：4
3张霄宏,海林鹏,贾宗璞,沈记全,赵文涛.同构Hadoop环境作业执行时间计算方法[J].计算机工程与应用,2014,50(10):249-252. 被引量：1
4邬江兴.拟态计算与拟态安全防御的原意和愿景[J].电信科学,2014,30(7):1-7. 被引量：99
5侯颖,黄海,兰巨龙,李鹏,朱圣平.基于自适应超时计数布鲁姆过滤器的流量测量算法[J].电子与信息学报,2015,37(4):887-893. 被引量：3
6余莹,李肯立,徐雨明.计算集群中一种基于任务运行时间的组合预测方案[J].计算机应用,2015,35(8):2153-2157. 被引量：2
7仝青,张铮,邬江兴.基于软硬件多样性的主动防御技术[J].信息安全学报,2017,2(1):1-12. 被引量：18
8魏帅,于洪,顾泽宇,张兴明.面向工控领域的拟态安全处理机架构[J].信息安全学报,2017,2(1):54-73. 被引量：35

引证文献2

1聂德雷,赵博,王崇,汪欣,燕昺昊.拟态多执行体架构下的超时阈值计算方法[J].网络与信息安全学报,2018,4(10):68-76. 被引量：1
2蔡宇昂.分布式文件系统I/O拥塞控制研究[J].绿色科技,2018,20(24):184-186.

二级引证文献1

1普黎明,卫红权,李星,江逸茗.面向云应用的拟态云服务架构[J].网络与信息安全学报,2021,7(1):101-112. 被引量：7

1杨革,徐虹.Paxos算法的研究与改进[J].科技创新与应用,2017,7(7):25-26. 被引量：6
2谢茂涛.一种基于超时机制的分簇无线传感器网络MAC协议[J].计算机时代,2008(8):16-17. 被引量：2
3胡建军.一种改进Go-Back-N ARQ策略研究[J].计算机应用与软件,2011,28(7):230-232. 被引量：3
4万为清.Aspx实现网页内定时自动跳转的方法[J].电脑编程技巧与维护,2014(20):101-101. 被引量：1
5后期显示器反差[J].数码摄影,2011(12):164-164.
6聂秀英.群件和计算机支持的协同工作简介[J].计算机与通信,1996(1):19-20.
7优化Firefox浏览器[J].新电脑,2016,0(7):68-69.
8陈俊鹏.基于Xen的大型计算机系统多域更新机制研究[J].数字技术与应用,2016,34(12):112-112. 被引量：1
9张祥梅.Windows平台下图书馆VPN的应用与改进[J].贵图学刊,2013(1):45-48. 被引量：1
10胡建军,郑伟强,邢玉娟,王万军,李恒杰.一种GBN协议新的数学分析模型[J].科学技术与工程,2012,20(33):8929-8932.

软件学报

2010年第12期

浏览历史

内容加载中请稍等...

大规模集群中一种自适应可扩展的RPC超时机制被引量：2

参考文献16

同被引文献8

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

大规模集群中一种自适应可扩展的RPC超时机制 被引量：2

参考文献16

同被引文献8

引证文献2

二级引证文献1

相关作者

相关机构

相关主题

浏览历史

大规模集群中一种自适应可扩展的RPC超时机制被引量：2