摘要
数据中心工作负载混合部署在显著提升云数据中心的资源利用率的同时,也增加了调度的复杂性和作业的失效率。以阿里云发布的数据中心日志数据集cluster-trace-v2018为例,从离线批处理工作负载角度出发,详细地分析了不同类型工作负载在成功率和资源利用上的特征。主要发现如下:1)少量类型作业的失效会影响集群整体作业成功率并造成集群资源的浪费;2)伏羲分布式调度系统在任务故障切换执行时间上满足高斯分布,在任务调度延迟方面满足齐夫分布;3)通过分析失败实例在集群节点上的分布,发现集群作业发生失败在空间上具有随机性,且失败的实例很容易再次发生失败,而在时间上集群整体失败率则存在不平衡性;4)以任务实例的失效为基准,计算了集群节点的平均无故障时间,大部分节点的平均无故障时间在1000 s左右,小部分节点的任务实例失效率低,其平均无故障时间可达10000 s以上。
Datacenter workload co-location can greatly increase the resource utilization of cloud data centers,while it also increases the scheduling complexity and job failures.In this paper,the cluster trace dataset from Alibaba Cloud is investigated,and the characteristics of batch workload failure rates and cluster resource utilization are studied.The main contributions and findings of this paper are as follows.First,Only a small portion of specific types of jobs account for the overall cluster failure rate and resource waste due to job failures.Second,the execution time of task failover in the Fuxi distributed scheduler can be quantified as Gaussian distribution,and the task scheduling delay can be quantified as Zipf distribution.Third,Based on the failed instances distribution on cluster nodes,it’s found that the job failures randomly occur in the cluster spatially,and the failed jobs are prone to fail again after their failovers.Moreover,job failures occur in the cluster temporally but not evenly distributed in the cluster.Fourth,the mean time between failures of the cluster is calculated according to instance failure data,and the results show that most of the cluster nodes have the mean time between failures values as 1000 seconds,while a few of them have the mean time between failures values as 10000 seconds.
作者
蒋从锋
殷继亮
胡海周
闫龙川
张纪林
万健
仇烨亮
JIANG Cong-feng;YIN Ji-liang;HU Hai-zhou;YAN Long-chuan;ZHANG Ji-lin;WAN Jian;QIU Ye-liang(School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018,China;State Grid Electrical Information Communication Co.,Ltd.,Beijing 100053,China;School of Cyberspace Security,Hangzhou Dianzi University,Hangzhou 310018,China;School of Information and Electronic Engineering,Zhejiang University of Science and Technology,Hangzhou 310023,China;Alibaba Cloud Computing Co.,Ltd.,Hangzhou 311121,China)
出处
《计算机科学》
CSCD
北大核心
2021年第S02期225-231,264,共8页
Computer Science
基金
国家重点研发计划项目(2017YFB101000)
国家自然科学基金面上项目(61972118)
浙江省重点研发计划项目(2019C01059)。
关键词
混合部署
工作负载特征
分布式调度
失效分析
Co-located cluster
Workload characteristics
Distributed scheduling
Failure analysis