期刊文献+

面向高性能计算的分布式故障定位框架 被引量:4

Distributed fault localization framework for high performance computing
下载PDF
导出
摘要 针对高性能计算系统中故障定位难度高且实时性差的问题,提出了一种基于消息传递的故障定位框架(MPFL),包括基于树形拓扑的故障检测(TFD)和故障分析(TFA)算法。首先,在并行作业初始化时,将所有参与计算的节点进行逻辑上的树形划分,生成故障定位树(FLT),并将故障定位任务分布到节点上;然后,当消息库、操作系统等组件检测到节点异常状态时,基于TFD算法分析作业的FLT结构,根据负载平衡、性能开销等因素选择接收异常状态的节点;最后,节点利用TFA算法对接收到的异常状态进行推理得出故障,TFA算法使用基于规则的事件关联,并基于消息传递设计轻量级的主动探测,将两种方式相结合,提高了故障分析的准确性。实验以模拟节点停机故障为定位目标,并以NPB-FT与NPB-IS为基准测试,在集群上对MPFL框架进行了评估。实验结果表明,MPFL框架在故障定位能力与开销节省方面表现突出。 To solve the problem of high difficulty and poor real-time in fault localization for high performance computing system, a Message-Passing based Fault Localization (MPFL) framework was proposed, which included Tree-based Fault Detection (TFD) and Tree-based Fault Analysis (TFA) algorithms. Firstly, when the parallel application was initialized, the Fault Localization Tree (FLT) was obtained by logically dividing all the nodes participating in the computing, and the fault localization tasks were distributed to different nodes. Secondly, if the abnormal state of a node was detected by system components such as message-passing library and operating system, the TFD algorithm was used to analyze the FLT structure, and the node responsible for receiving the abnormal state was selected according to factors such as load balancing and performance cost. Finally, the fault was derived from the received abnormal state, which was reasoned by the node that used TFA algorithm. The rule-based event correlation and the lightweight active probing based on message-passing were used in TFA algorithm, and the accuracy of fault analysis was improved by combining these two approaches. The experimental evaluation was performed on a typical cluster, which demonstrated the capability of MPFL by locating the shutdown simulation nodes. The experimental results on the NPB-FT and NPB-IS benchmarks show that the MPFL framework has excellent performance on fault localization capability and cost saving.
出处 《计算机应用》 CSCD 北大核心 2018年第1期44-49,共6页 journal of Computer Applications
基金 国家重点研发计划项目(2016YFB0200502)~~
关键词 高性能计算 消息传递 故障定位 事件关联 主动探测 high performance computing message-passing fault localization event correlation active probing
  • 相关文献

参考文献2

二级参考文献18

  • 1Francis P, Jamin S, Paxson V, et al. An architecture for a global Internet host distance estimation service [ C ]// IEEE INFOCOM 1999. New York. [s. n.], 1999. 210-217.
  • 2Jamin S, Jin C, Jin Y, et al. On the placement of Internet instrumentation [ C] //In IEEE INFOCOM 2000. Israel: [s.n. ], 2000: 295-304.
  • 3Downey A B. Using pathchar to estimate Internet link characteristics[C]//ACM SIGCOMM 1999. Cambridge: [s.n. ], 1999: 222-223.
  • 4Bolot J C. End-to-end packet delay and loss behavior in the Intemet[ C ] // ACM SIGCOMM 1993. San Francisco: [s.n.], 1993: 289-298.
  • 5Li Fei, Thottan M. End-to-end service quality measurement using source-routed probes[ C]//INFOCOM 2006. Barcelona: [s.n.], 2006: 1-12.
  • 6Breitbart Y, Chong C Y, Garofalakis M, et al. Efficiently monitoring bandwidth and latency in IP networks [C]//IEEE INFOCOM 2000. Israel: [s. n. ], 2000: 933-942.
  • 7Natu M, Sethi A S. Probe station placement for robust monitoring of networks[J]. Journal of Network and Systems Management, 2008, 16(4): 351-374.
  • 8Wagneur E. Moduloids and pseudomodules: 1 dimension theory[J]. Discrete Mathematics, 1991, 98: 57-73.
  • 9Kim J. Boolean matrix theory and applications[M]. New York: Marcel Dekker, 1982.
  • 10Winick J, Jamin S. Inet-3.0 : Internet topology generator[R]. University of Michigan, 2002.

共引文献4

同被引文献24

引证文献4

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部