期刊文献+

一种大规模并行作业运行故障快速定位方法 被引量:1

A Rapid Fault Localization Method for Large-scale Parallel Jobs
下载PDF
导出
摘要 基于状态获取的故障信息,对可能导致作业运行失败的原因事件进行分类和严重等级分级,进而通过问题规模及其关联关系,提出了一种针对批量大规模并行作业运行故障的快速分析定位方法.该定位方法由上而下、逐层排查故障原因,缩小故障的处理范围,有效解决了大规模作业运行过程中故障定位难且准确性差的问题,为运行管理人员排查故障提供帮助. A fault localization method for large-scale parallel jobs was proposed,which was based on the fault information obtained from running states.Then possible events which could cause the job failure were listed,and their classification and severity were provided.Furthermore,the fault analysis method for batch of large-scale parallel jobs was established by the scale of the problem and its correlation.From top to bottom,the root cause could be detected layer by layer with automatic judgement analysis,which reduced the range of faults and effectively solved the problem of high difficulty and poor accuracy in fault localization.Finally,the method was evaluated,and the results showed that it could help operators in detecting faults.
作者 朱光慧 曾云辉 ZHU Guanghui;ZENG Yunhui(Qilu University of Technology(Shandong Academy of Sciences),Jinan 250101,China;Shandong Provincial Computer Science Center(National Supercomputer Center in Jinan),Jinan 250101,China;Shandong Provincial Key Laboratory of Computer Networks,Jinan 250101,China)
出处 《郑州大学学报(理学版)》 CAS 北大核心 2019年第4期102-109,共8页 Journal of Zhengzhou University:Natural Science Edition
基金 国家重点研发计划项目(2016YFB0201100)
关键词 故障定位 并行作业 高性能计算 大规模 关联关系 fault localization parallel job high performance computing large-scale correlation
  • 相关文献

参考文献3

二级参考文献10

共引文献8

同被引文献3

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部