期刊文献+

Autonomic failure prediction based on manifold learning for large-scale distributed systems 被引量:2

Autonomic failure prediction based on manifold learning for large-scale distributed systems
原文传递
导出
摘要 This article investigates autonomic failure prediction in large-scale distributed systems with nonlinear dimensionality reduction to automatically extract failure features. Most existing methods for failure prediction focus on building prediction models or heuristic rules by discovering failure patterns, but the process of feature extraction before failure patterns recognition is rarely considered due to the increasing complexity of modern distributed systems. In this work, a novel performance-centric approach to automate failure prediction is proposed based on manifold learning (ML). In addition, the ML algorithm named supervised locally linear embedding (SLLE) is applied to achieve feature extraction. To generalize the dimensionality reduction mapping, the nonlinear mapping approximation and optimization solution is also proposed. In experimental work a file transfer test bed with fault injection is developed which can gather multilevel performance metrics transparently. Based on the runtime monitoring of these metrics, the SLLE method can automatically predict more than 50% of the central processing unit (CPU) and memory failures, and around 70% of the network failure. This article investigates autonomic failure prediction in large-scale distributed systems with nonlinear dimensionality reduction to automatically extract failure features. Most existing methods for failure prediction focus on building prediction models or heuristic rules by discovering failure patterns, but the process of feature extraction before failure patterns recognition is rarely considered due to the increasing complexity of modern distributed systems. In this work, a novel performance-centric approach to automate failure prediction is proposed based on manifold learning (ML). In addition, the ML algorithm named supervised locally linear embedding (SLLE) is applied to achieve feature extraction. To generalize the dimensionality reduction mapping, the nonlinear mapping approximation and optimization solution is also proposed. In experimental work a file transfer test bed with fault injection is developed which can gather multilevel performance metrics transparently. Based on the runtime monitoring of these metrics, the SLLE method can automatically predict more than 50% of the central processing unit (CPU) and memory failures, and around 70% of the network failure.
出处 《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2010年第4期116-124,共9页 中国邮电高校学报(英文版)
基金 Acknowledgements This work was supported by the Hi-Tech Research and Development Program of China (2007AA01Z401), the National Natural Science Foundation of China (90718003, 60973027).
关键词 failure prediction manifold learning locally linear embedding autonomic computing failure prediction, manifold learning, locally linear embedding, autonomic computing
  • 相关文献

参考文献15

  • 1Liang Y L, Zhang Y Y, Jette M, et al. BlueGene/L failure analysis and prediction models. Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), Jun 25-28, 2006, Philadelphia, PA, USA. New York, NY, USA: ACM, 2006:425-434.
  • 2Salfner F, Malek M. Using hidden semi-Markov models for effective online failure prediction. Proceedings of the 26th IEEE Symposium on Reliable Distributed Systems (SRDS'07), Oct 10-12, 2007, Beijing, China. Piscataway, N J, USA: IEEE, 2007:161-174.
  • 3Fu S, Xu C Z. Exploring event correlation for failure prediction in coalitions of dusters. Proceedings of the 21 st International Conference on High Performance Computing, Networking, Storage and Analysis (SC'07), Nov 10-16, 2007, Reno, NV, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2007:456-468.
  • 4Hacker T J, Romero F, Carothers C D. An analysis of clustered failures on large supereomputing systems. Journal of Parallel and Distributed Computing, 2009, 69 (7): 652-665.
  • 5Salfner F, Troger P, Tschirpke S. Cross-core event monitoring for processor failure prediction. Proceedings of the 23rd International Symposium on High Performance Computing and Simulation (HPCS'09), Jun 21-24, 2009, Leipzig, Germany. Los Alamitos, CA, USA: 1EEE Computer Society, 2009:67-73.
  • 6Taerat N, Nakisinehaboon N, Chandler C, et al. Using log intbrmanon to perform statistical analysis on failures encountered by large-scale HPC deployments. Proceeding of the 5th High Availability and Performance Computing Workshop (HAPCW'08), Apr 2-4, 2008, Denver, CO, USA. 2008.
  • 7Solano-Quinde L D, Bode B M. Module prototype for online failure prediction for the IBM BlueGene/L. Proceeding of the IEEE International Conference on Electro/Information Technology (EIT'08), May 18 20, 2008, Ames, IA, USA. Piscataway, NJ, USA: IEEE, 2008:470-474.
  • 8Zhang Y Y, Sivasubramaniam A. Failure prediction in IBM BlueGene/L event logs. Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07), Oct 28-31, 2007, Omaha, NE, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2007:583-588.
  • 9Liang Y L, Zhang Y Y, Sivasubramaniam A, et al. Filtering failure logs for a BlueGene/L prototype. Proceedings of the International Conterence on Dependable Systems and Networks (DSN'05), Jun 28-Jul 1, 2005, Yokohama, Japan. Los Alamitos, CA, USA: IEEE Computer Society, 2005:476-485.
  • 10Sehroeder B, Gibson G A. A large-scale study of failures in high-performance computing systems. Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), Jun 25- 28, 2006, Philadelphia, PA, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2006:249-258.

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部