期刊文献+

一种验证分布式协议活性属性容错机制的模型检测方法 被引量:2

A Model Checking Method for Verifying the Fault Tolerance of Distributed Protocol Liveness Properties
下载PDF
导出
摘要 云计算是一种通过网络以服务的方式向用户提供按需收费的计算资源的模式,目前企业逐渐将业务部署、数据处理转移到云计算平台上进行。因为可扩展性、性能等各方面需求,所以云平台部署在分布式系统上。由于分布式系统采用大量的商品机通过复杂的结构进行搭建,因此分布式系统中组件发生故障是无法避免的。为了提高分布式系统的可靠性,技术人员在开发分布式系统时为其设计了容错机制。为了保证容错机制在分布式系统发生故障时能真正有效地工作,故障注入是检验容错机制的方法之一,通过人为地向系统中注入特定的故障,观察系统的行为并检验容错机制是否正确工作。由于分布式系统的并发特性,传统软件测试方法无法对其进行完全测试,近年来越来越多地使用模型检测技术来对分布式系统进行验证。现有的模型检测技术注重对分布式系统的安全性属性和活性属性的检测,忽略了对容错机制尤其是活性属性容错机制的检测,所以如何验证系统的活性属性容错机制是目前面临的挑战。采用抽象模型检测方法会引入模型与实际系统不匹配的问题。同时,采用实现级模型检测方法会加剧模型检测中的状态空间爆炸问题。本文提出了一个实现级模型检测工具LTMC(Liveness Properties Fault Tolerance Model Checker),结合故障注入技术对分布式协议的安全性属性与活性属性及其容错机制进行验证。同时,基于分布式系统节点的角色,本文提出了一种对等约减策略PRP(Peer Reduction Policy)对LTMC需要搜索的状态空间进行约减,缓解了状态空间爆炸问题。此外,LTMC通过引入逻辑时钟机制,优先搜索那些更有实际价值的事件执行路径。LTMC能够有目标地在待验证系统运行的特定时刻注入特定的故障,而不依赖于随机故障注入策略;当待验证系统发生改变时,只需要简单地对工具进行轻微的修改;LTMC可以系统地发现分布式协议中指定类型的所有Bug。在本文最后,我们将LTMC应用到ZooKeeper和Cassandra的几个协议中,并与深度优先搜索作对比,可以发现LTMC有3.7~594.4倍的状态空间约减率。 Cloud computing is a new mode that provides users with computing resources charged on demand through a communication network as a service.At present,enterprises are gradually transferring business deployment and data processing to cloud computing platforms.Because of various requirements such as scalability and performance,cloud platforms are deployed on distributed systems.Since the distributed system uses a large number of commodity machines to build through a complex structure,technicians cannot guarantee that these machines will work correctly all the time.Therefore,component failures in the distributed system are unavoidable.In order to improve the reliability of the distributed system,the technicians designed fault-tolerant mechanisms for them when developing the distributed system.In order to ensure that the faulttolerant mechanisms can really work correctly when there is a failure in the distributed system,fault injection is one of the most effective ways to test the fault-tolerant mechanism,observing the behavior of the system and verifying whether the fault-tolerance mechanism is working correctly,by artificially injecting specific faults into the system under test.Due to the concurrent nature of distributed systems,it is very difficult for traditional software testing methods to fully test such systems.In recent years,technicians have increasingly used more systematic model checking techniques to verify distributed systems.However,existing model checking technology focuses on the checking of the safety properties and liveness properties of distributed systems,and ignores the checking of fault-tolerant mechanisms,especially liveness properties fault-tolerant mechanisms.As a result,how to verify the liveness properties fault-tolerant mechanism of the distributed system is currently a challenge.The use of abstract model checking method will introduce the problem of mismatch between the model and the actual system because of human error.At the same time,the use of implementation-level model checking method will aggravate the state space explosion problem.As a result,in this paper we propose an implementation-level model checking tool LTMC(Liveness Properties Fault Tolerance Model Checker),which combines fault injection technology to verify the liveness properties with their fault tolerance mechanisms and safety properties of the distributed protocols in the distributed systems.At the same time,based on the role of distributed system nodes,this paper proposes a PRP strategy(Peer Reduction Policy)to reduce the state space that LTMC needs to search,alleviating the problem of space explosion.LTMC can purposefully inject specific faults at specific moments when the system to be verified is running,instead of relying on random fault injection strategies;when the system to be verified changes,only slight modifications to the model checker are required;LTMC can systematically discover all bugs of the specified types in the distributed protocols.In addition,LTMC prioritizes the exploration of more practical execution paths of events by introducing a logic clock mechanism.At the end of this article,we apply LTMC to several protocols of ZooKeeper and Cassandra which are the most popular distributed systems on the market,and compare LTMC with depth-first exploration.LTMC has a state space reduction rate of 3.7-594.4 times.
作者 陆超逸 聂长海 张成志 LU Chao-Yi;NIE Chang-Hai;ZHENG Cheng-zhi(State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023;The Hong Kong University of Science and Technology,Hong Kong 999077)
出处 《计算机学报》 EI CAS CSCD 北大核心 2021年第8期1714-1731,共18页 Chinese Journal of Computers
基金 国家重点研发计划(2018YFB1003800)资助.
关键词 分布式系统 模型检测 故障注入 活性属性 容错机制 对等约减策略 distributed system model checking fault injection liveness properties fault-tolerance mechanisms peer reduction policy
  • 相关文献

同被引文献17

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部