期刊文献+

Bayesian serial revision method for RLLC cluster systems failure prediction

Bayesian serial revision method for RLLC cluster systems failure prediction
下载PDF
导出
摘要 Failure prediction plays an important role for many tasks such as optimal resource management in large-scale system. However, accurately failure number prediction of repairable large-scale long-running computing (RLLC) is a challenge because of the reparability and large-scale. To address the challenge, a general Bayesian serial revision prediction method based on Bootstrap approach and moving average approach is put forward, which can make an accurately prediction for the failure number. To demonstrate the performance gains of our method, extensive experiments on the data of Los Alamos National Laboratory (LANL) cluster is implemented, which is a typical RLLC system. And experimental results show that the prediction accuracy of our method is 80.2 %, and it is a greatly improvement with 4 % compared with some typical methods. Finally, the managerial implications of the models are discussed. Failure prediction plays an important role for many tasks such as optimal resource management in large-scale system. However, accurately failure number prediction of repairable large-scale long-running computing (RLLC) is a challenge because of the reparability and large-scale. To address the challenge, a general Bayesian serial revision prediction method based on Bootstrap approach and moving average approach is put forward, which can make an accurately prediction for the failure number. To demonstrate the performance gains of our method, extensive experiments on the data of Los Alamos National Laboratory (LANL) cluster is implemented, which is a typical RLLC system. And experimental results show that the prediction accuracy of our method is 80.2 %, and it is a greatly improvement with 4 % compared with some typical methods. Finally, the managerial implications of the models are discussed.
出处 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2011年第2期238-246,共9页 系统工程与电子技术(英文版)
基金 supported by the National Natural Science Foundationof China (60701006 60804054 71071158)
关键词 failure prediction cluster systems Bayesian approach failure rate. failure prediction, cluster systems, Bayesian approach, failure rate.
  • 相关文献

参考文献24

  • 1B. Schroeder, G. A. Gibson. A large-scale study of failures in high-performance computing systems. Proc. of the Interna- tional Conference on Dependable Systems and Networks, 249- 258.
  • 2The advanced computing systems association, http://cfdr. usenix.org/.
  • 3B. Schroeder, G. A. Gibson. The computer failure data repos- itory (CFDR). Proc. of the Workshop on Reliability Analysis of System Failure Data, 2007.
  • 4M. J. Brim, T. G. Mattson, S. L. Scott. Open source cluster ap- plication resources. Proc. of Ottawa Linux Symposium, 2001.
  • 5K. J. Ryan, C. S. Reese. Estimating reliability trends for the world's fastest computer. Los Alamos National Laboratory Technical Report, 2000.
  • 6R. K. Sahoo, A. Sivasubramaniam. Failure data analysis of a large-scale heterogeneous server environment. Proc. of the In- ternational Conference on Dependable Systems and Networks, 2004: 772-783.
  • 7K. W. Harris. Asymmetries in soft-error rates in a large clus- ter system. IEEE Trans. on Device and Materials Reliability, 2005, 5(2): 336-342.
  • 8S. Fu, C. Z. Xu. Exploring event correlation for failure predic- tion in coalitions of clusters. Proc. of the ACM/IEEE Confer- ence on High Performance Networking and Computing, 2007.
  • 9S. E. Michalak, K. W. Harris, N. W. Hengartner. Predicting the number of fatal soft errors in Los Alamos National Lab- oratorys ASC Q supercomputer. IEEE Trans. on Device and Materials Reliability, 2005, 5(3): 329-335.
  • 10J. A. Beiser, S. E. Rigdon. Bayes prediction for the number of failures of a repairable system. IEEE Trans. on Reliability, 1997, 146(2): 291-297.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部