Autonomic failure prediction based on manifold learning for large-scale distributed systems 被引量：2

Autonomic failure prediction based on manifold learning for large-scale distributed systems

导出

摘要 This article investigates autonomic failure prediction in large-scale distributed systems with nonlinear dimensionality reduction to automatically extract failure features. Most existing methods for failure prediction focus on building prediction models or heuristic rules by discovering failure patterns, but the process of feature extraction before failure patterns recognition is rarely considered due to the increasing complexity of modern distributed systems. In this work, a novel performance-centric approach to automate failure prediction is proposed based on manifold learning （ML）. In addition, the ML algorithm named supervised locally linear embedding （SLLE） is applied to achieve feature extraction. To generalize the dimensionality reduction mapping, the nonlinear mapping approximation and optimization solution is also proposed. In experimental work a file transfer test bed with fault injection is developed which can gather multilevel performance metrics transparently. Based on the runtime monitoring of these metrics, the SLLE method can automatically predict more than 50% of the central processing unit （CPU） and memory failures, and around 70% of the network failure. This article investigates autonomic failure prediction in large-scale distributed systems with nonlinear dimensionality reduction to automatically extract failure features. Most existing methods for failure prediction focus on building prediction models or heuristic rules by discovering failure patterns, but the process of feature extraction before failure patterns recognition is rarely considered due to the increasing complexity of modern distributed systems. In this work, a novel performance-centric approach to automate failure prediction is proposed based on manifold learning （ML）. In addition, the ML algorithm named supervised locally linear embedding （SLLE） is applied to achieve feature extraction. To generalize the dimensionality reduction mapping, the nonlinear mapping approximation and optimization solution is also proposed. In experimental work a file transfer test bed with fault injection is developed which can gather multilevel performance metrics transparently. Based on the runtime monitoring of these metrics, the SLLE method can automatically predict more than 50% of the central processing unit （CPU） and memory failures, and around 70% of the network failure.

作者 LU Xu WANG Hui-qiang ZHOU Ren-jie GE Bao-yu

机构地区 College of Computer Science and Technology

出处《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2010年第4期116-124,共9页 中国邮电高校学报（英文版）

基金 Acknowledgements This work was supported by the Hi-Tech Research and Development Program of China （2007AA01Z401）, the National Natural Science Foundation of China （90718003, 60973027）.

关键词 failure prediction manifold learning locally linear embedding autonomic computing failure prediction, manifold learning, locally linear embedding, autonomic computing

分类号 TP393 [自动化与计算机技术—计算机应用技术] TH17 [机械工程—机械制造及自动化]

引文网络
相关文献

参考文献15

1Liang Y L, Zhang Y Y, Jette M, et al. BlueGene/L failure analysis and prediction models. Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), Jun 25-28, 2006, Philadelphia, PA, USA. New York, NY, USA: ACM, 2006:425-434.
2Salfner F, Malek M. Using hidden semi-Markov models for effective online failure prediction. Proceedings of the 26th IEEE Symposium on Reliable Distributed Systems (SRDS'07), Oct 10-12, 2007, Beijing, China. Piscataway, N J, USA: IEEE, 2007:161-174.
3Fu S, Xu C Z. Exploring event correlation for failure prediction in coalitions of dusters. Proceedings of the 21 st International Conference on High Performance Computing, Networking, Storage and Analysis (SC'07), Nov 10-16, 2007, Reno, NV, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2007:456-468.
4Hacker T J, Romero F, Carothers C D. An analysis of clustered failures on large supereomputing systems. Journal of Parallel and Distributed Computing, 2009, 69 (7): 652-665.
5Salfner F, Troger P, Tschirpke S. Cross-core event monitoring for processor failure prediction. Proceedings of the 23rd International Symposium on High Performance Computing and Simulation (HPCS'09), Jun 21-24, 2009, Leipzig, Germany. Los Alamitos, CA, USA: 1EEE Computer Society, 2009:67-73.
6Taerat N, Nakisinehaboon N, Chandler C, et al. Using log intbrmanon to perform statistical analysis on failures encountered by large-scale HPC deployments. Proceeding of the 5th High Availability and Performance Computing Workshop (HAPCW'08), Apr 2-4, 2008, Denver, CO, USA. 2008.
7Solano-Quinde L D, Bode B M. Module prototype for online failure prediction for the IBM BlueGene/L. Proceeding of the IEEE International Conference on Electro/Information Technology (EIT'08), May 18 20, 2008, Ames, IA, USA. Piscataway, NJ, USA: IEEE, 2008:470-474.
8Zhang Y Y, Sivasubramaniam A. Failure prediction in IBM BlueGene/L event logs. Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07), Oct 28-31, 2007, Omaha, NE, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2007:583-588.
9Liang Y L, Zhang Y Y, Sivasubramaniam A, et al. Filtering failure logs for a BlueGene/L prototype. Proceedings of the International Conterence on Dependable Systems and Networks (DSN'05), Jun 28-Jul 1, 2005, Yokohama, Japan. Los Alamitos, CA, USA: IEEE Computer Society, 2005:476-485.
10Sehroeder B, Gibson G A. A large-scale study of failures in high-performance computing systems. Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), Jun 25- 28, 2006, Philadelphia, PA, USA. Los Alamitos, CA, USA: IEEE Computer Society, 2006:249-258.

引证文献2

1Rui-Tao Liu,Zuo-Ning Chen.A Large-Scale Study of Failures on Petascale Supercomputers[J].Journal of Computer Science & Technology,2018,33(1):24-41. 被引量：2
2刘睿涛,陈左宁.数据驱动的自适应容错技术研究[J].计算机工程,2018,44(12):46-55. 被引量：1

二级引证文献3

1刘睿涛,陈左宁.基于统计数据的超级计算机内存故障分析[J].计算机工程,2019,45(5):35-45. 被引量：1
2杨平平.一种离线电子钱包交易的双向容错控制方法[J].电子技术与软件工程,2020(22):76-78.
3高剑刚,胡晋,龚道永,方燕飞,刘骁,何王全,金利峰,郑方,李宏亮.神威太湖之光可靠性及可用性设计与分析[J].计算机研究与发展,2021,58(12):2696-2707. 被引量：3

1张凤斌,杨辉.非负矩阵分解在入侵检测中的应用[J].哈尔滨理工大学学报,2008,13(2):19-22. 被引量：3
2ZHU Yu-jia,DENG Zhong-liang,JI Hao.Indoor localization via l^1-graph regularized semi-supervised manifold learning[J].The Journal of China Universities of Posts and Telecommunications,2012,19(5):39-44. 被引量：2
3张健,赵立军.基于改进正2k边形坐标的可视化方法研究[J].计算机工程与应用,2016,52(14):261-265.
4黄鸿,李见为,冯海亮.基于半监督流形学习的人脸识别方法[J].计算机科学,2008,35(12):220-223. 被引量：6
5陈旭毅.商务数据挖掘与可视化实现方法[J].现代图书情报技术,2007(11):91-94. 被引量：1
6陈建英,杨宪泽,张楠.面向大规模分布式系统的多级缓存信息结构研究[J].西南民族大学学报（自然科学版）,2012,38(3):457-460. 被引量：1
7韩志,孟德宇,徐宗本,古楠楠.Incremental Alignment Manifold Learning[J].Journal of Computer Science & Technology,2011,26(1):153-165. 被引量：1
8鄢烈祥,麻德贤.基于人工神经网络降维映射的统计优化方法[J].应用基础与工程科学学报,1998,6(3):100-105. 被引量：6
9鲁珂,赵继东,吴跃,何晓飞.基于保局投影的相关反馈算法[J].计算机辅助设计与图形学学报,2007,19(1):20-24. 被引量：8
10包约翰,沈长云.通过非线性方差守恒的降维映射实现模式数据可视化[J].舰船电子工程,1998,18(6):34-44.

The Journal of China Universities of Posts and Telecommunications

2010年第4期

浏览历史

内容加载中请稍等...

Autonomic failure prediction based on manifold learning for large-scale distributed systems 被引量：2

参考文献15

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史