摘要
在基于RPC(remote produce call)构建的分布式系统中,超时是一种通用的失效检测手段.在超大规模Lustre存储集群的压力测试中,发现传统的固定超时机制会导致很多不必要的超时而存在缺陷.提出了一种综合考虑了网络条件、服务器负载、扩展性和性能等因素的自适应可扩展的RPC超时机制(Adaptive Scalable RPC Timeout mechanism,简称AST).在其控制下,客户端超时值可以根据网络和服务器的拥塞情况动态地调整设置,而且服务器可以通过额外消息传递通知客户端修改原超时值.经过一系列的模拟和验证,其结果表明,AST是一种更适合的RPC失效检测模型,增强了系统的响应性、可靠性和稳定性,而且对系统的性能没有过大的负面影响.
Timeouts are usually used for failure detection in RPC (remote produce call) based systems, which are typically reported on a per-call basis. During pressure testing, on a very large cluster system, it has been found that the traditional fixed timeout mechanism leads lots of unnecessary timeouts, especially when the server loading is involved. This paper proposes an Adaptive Scalable RPC Timeout (AST for short) mechanism that considers network conditions, server load, scalability, and performance. Under this control, the timeout value, set by clients, can be adapted and adjusted in a dynamic fashion, according to congestion of the network and the server. Moreover, the server can notify the client to modify the timeout value of the RPC. Via a series of simulations, it has been proved that the AST mechanism is a more suitable failure detection mechanism for RPC models with timeouts, and it enhances the system responsibility, reliability, and stability without negative impact on performance, even for large-scaled cluster systems.
出处
《软件学报》
EI
CSCD
北大核心
2010年第12期3199-3210,共12页
Journal of Software
基金
国家自然科学基金No.60736013~~
关键词
远程过程调用
失效检测
超时
大规模
扩展性
响应性
可靠性
RPC (remote produce call)
failure detection
timeout
large scale
scalability
responsibility
reliability