期刊文献+

浅析E级超级计算机故障预测的数据采集方法

下载PDF
导出
摘要 随着社会不断的发展,我国现代化技术水平逐渐提高,然而超级计算机在发过程中仍面临着许多的挑战,其安全性、可靠性问题成为了影响整个超级计算机系统性能发展问题之一。而E级超级计算机主要由数十个万的部件组成,由于其中的部件较多,在实际运行期间会因为多种原因出现故障问题,如果这一问题不能及时解决就会导致超级计算机整个系统处于被被动停止的状态并重新开始运行。要想从根本上解决这一问题,就需要对其中存在的故障进行预测,只有这样才能保证超级计算机的使用安全,提高其稳定性。基于此,本文对E级超级计算机故障预测的数据采集方法进行了简单的研究。
作者 陆鹏 LU Peng
出处 《信息技术与信息化》 2017年第1期60-62,共3页 Information Technology and Informatization
  • 相关文献

参考文献2

二级参考文献15

  • 1Ptulp I R. Software failures and the road to a petaflop machine[ C ]// Proceedings of the 11 th International Symposium on High Performance Computer Architecture, San Francisco, CA, USA, IEEE Computer Society, 2005.
  • 2Liang Y, Zhang Y, Xiong H, et al. Failure prediction in IBM BlueGene/L event logs [ C ]//Proceedings of Seventh IEEE International Conference on Data Mining Omaha, Nebraska, USA, IEEE Computer Society, 2007:583 - 588 .
  • 3LanZ L, Gu J X, Zheng Z M, et al. A study of dynamic meta-learning for failure prediction in large-scale systems[J]. Journal of Parallel and Distributed Computing, 2010, 70 (6) : 630 - 643.
  • 4Oliner A, Ganapathi A, Xu W. Advances and challenges in log analysis [ J ]. Communications of the ACM , 2012, 55(2): 55 -61.
  • 5Xu W, Huang L, Fox A, et al. Detecting large-scale system problems by mining console logs [ C ]//Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles,New York, NY, USA: ACM, 2009.
  • 6Gainaru A, Cappello F, Snir M, et al. Fault prediction under the microscope: a closer look into HPC systems [ C ]//Proceedings of the International Conference High Performance Computing, Networking, Storage and Analysis, Los Alamitos, CA, USA, IEEE Computer Society Press, 2012.
  • 7Scott S L, Engelmann C, Vallre G R, et al. A tunable holistic resiliency approach for high-performance computing systems [ C ]//Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,New York, NY, USA,ACM, 2009.
  • 8Nagarajan A B, Mueller F, Engelmann C, et al. Proactive fault tolerance for HPC with Xen virtualization [ C ]// Proceedings of the 21st Annual International Conference on Supercomputing,New York, NY, USA, ACM, 2007: 23- 32.
  • 9Rajachandrasekar R, Besseron X, Panda D K. Monitoring and predicting hardware failures in HPC clusters with FTB- IPMI[ C]//Proceedings of the 2012 IEEE 26th Iuternational Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012 : 1136 - 1143.
  • 10Sahoo R K, Oliner A J, Rish I, et al. Critical event prediction for proactive management in large-scale computer clusters [ C ]//Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA, ACM, 2003:426-435.

共引文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部