摘要
通过分析Google集群中任务的失效次数和失效模式,找到具有高失效频次和连续失效特征的杀手级任务。杀手级任务不仅影响云计算系统上应用运行的可靠性与可用性,而且会浪费大量资源并显著增加调度负载。在杀手级任务资源使用模式的基础上,提出一种基于时间序列的在线识别方法,以利用资源使用时间序列在失效早期准确识别出杀手级任务并通知云计算系统采取前摄性失效恢复措施,从而避免不必要的重复调度和资源浪费。实验结果表明,该方法能够以98.5%的准确率在平均3%的失效时间内识别出杀手级任务,同时节约96.75%的系统资源。
By analyzing failure frequency and failure patterns in Google cluster dataset,this paper fond what are called as killer tasks that suffer from frequent and continuous failure.Killer task is a big concern of cloud system as it causes unnecessary resource wasting and significant increase of scheduling overhead.In this paper,an online recognition approach was proposed to make use of the resource usage time series to recognize killer tasks precisely at the very early stage of their occurrence so that proactive actions can be taken to avoid rescheduling and resource wasting.The experiment results show that the proposed approach performs a 98.5% precision in recognizing killer tasks at 3% of failure duration,with a 96.75% resource saving for the cloud system averagely.
出处
《计算机科学》
CSCD
北大核心
2017年第4期43-46,共4页
Computer Science
基金
深圳市科技计划重点项目(JSGG20140516162852628)资助