期刊文献+

基于历史的云平台故障注入测试 被引量:6

History-Based Fault Injection Testing for Cloud Platform
下载PDF
导出
摘要 云计算是一种能够以便利的、按需付费的方式通过网络获取计算资源并提高其可用性的模式.近年来,以云计算为基础的服务平台——云平台逐渐成为各大企业数据存储和业务部署的主要平台.由于云平台结构复杂、服务多样,发生故障在所难免.为了提高云平台的可靠性,开发人员在设计云平台时加入了容错机制,目的是在发生故障的情况下也能保证云平台的正常运行.但是容错机制并不能保证云平台完全可靠,因此我们还需要对云平台的可靠性进行检验.故障注入是检验云平台可靠性的方法之一,通过人为地将故障注入正在运行的系统中,观察系统动作并判断系统的容错机制是否正常工作.现有的故障注入方法侧重于分析待测系统特征以确定故障注入点,属于白盒或灰盒测试,对复杂的云平台来说,这一工作无疑要耗费大量的时间.因此,我们提出一种不依赖于系统信息的黑盒测试方法以提高检验效率.本文在现有工作的基础上做了以下几个方面的工作:第一,我们收集了云平台历史宕机事故报告,并分析其中故障模式出现的规律.我们发现,云平台中发生的故障类型具有重复性,在此基础上,我们提取了这些故障的特征,包括所在组件、根因、产生的影响、修复方法等;第二,通过对云平台历史宕机事故报告的分析,我们发现很多事故当中的故障并不是单一出现的,并且多个故障之间具有关联性、组合性,我们深入分析了多故障之间的关系以及故障之间的组合形式,在此基础上,为了尽可能完全地检测云平台的可靠性,我们提出在故障注入过程中需要对多故障进行组合注入;第三,在对多故障进行组合的过程中我们发现,由于云平台的复杂性,故障种类的多样性,多故障之间的组合会产生组合空间爆炸问题,针对这一问题,我们做了初步探究,并提出了几种约减策略;第四,基于上述工作,我们提出了一种基于历史的故障组合方法,并利用历史故障数据,结合基础云平台架构进行模拟实验,实验结果表明我们提出的基于历史故障进行故障组合注入方法是有效可行的. Cloud computing is a model which can access to a pool of configurable computing resources that can be shared in a convenient,on-demand way through internet.These years,based on cloud computing,cloud platform which composed of foundation,a group of infrastructure services and some dedicated application services,has become a main platform for the deployment of enterprise applications and data storage.Due to the complexity of cloud platform architecture,and the diversity of services it provided,failure is difficult to avoid.In order to improve the reliability of the cloud platform,developers have added fault-tolerance mechanism when designing the cloud platform,the purpose is to ensure that even there is a failure in the cloud platform,it also have a normal operation performance.However,this fault-tolerance mechanism does not guarantee that the cloud platform is completely reliable.Therefore,we also need to test the reliability of the cloud platform.Fault injection is one of the ways to test the reliability of a cloud platform,by artificially injecting faults into the system under test,observing the system actions and determining if the system’s fault tolerance mechanism is working properly.However,most of the existing fault injection methods,which focus on analyzing the characteristics of the system under test to determine the fault injection location,belong to white or gray box testing.These methods will take a long time due to the complexity of the cloud platform.Therefore,we propose a black box testing method that does not depend on system analyzing in order to improve test efficiency.We have done the following works based on the existing works.First of all,we have collected historical cloud outage reports on cloud platforms and analyzed the characteristics of the failure modes which appears in reports.We have found that the types of failures in the cloud platform are repetitive.Based on this,we thoroughly analyzed the characteristics of these failures,including the components,root causes,impacts,and methods of repairing.Secondly,through the analysis of historical outage reports on cloud platforms,we have found that failure in many accidents does not occur alone,and that multiple failures are related and combined.We have deep analysed the relationships of multiple failures,and the combination relationships,based on this,in order to detect the reliability of the cloud platform as completely as possible,we propose that multiple faults must be combined and injected during the fault injection process.Thirdly,in the process of combining multiple faults,we discovered that due to the complexity of the cloud platform,the diversity of fault types,and the combination of multiple faults,the problem of combinatorial space explosion will arise.To address this issue,we have done preliminary exploration and proposed several reduction strategies.Fourthly,based on the above work,we propose a history-based fault combination method,using historical fault data,combined with the basic cloud platform architecture to conduct simulation experiments.The experiment results show that the proposed fault combination injection method based on historical faults is effective and feasible.
作者 马骅 聂长海 吴化尧 MA Hua;NIE Chang-Hai;WU Hua-Yao(State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210046)
出处 《计算机学报》 EI CSCD 北大核心 2019年第10期2281-2296,共16页 Chinese Journal of Computers
基金 国家重点研发计划(2018YFB1003800)资助
关键词 云平台 故障模式 历史故障 故障注入 cloud platform failure mode historical faults fault injection
  • 相关文献

同被引文献78

引证文献6

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部