浅析E级超级计算机故障预测的数据采集方法

下载PDF

导出

摘要随着社会不断的发展,我国现代化技术水平逐渐提高,然而超级计算机在发过程中仍面临着许多的挑战,其安全性、可靠性问题成为了影响整个超级计算机系统性能发展问题之一。而E级超级计算机主要由数十个万的部件组成,由于其中的部件较多,在实际运行期间会因为多种原因出现故障问题,如果这一问题不能及时解决就会导致超级计算机整个系统处于被被动停止的状态并重新开始运行。要想从根本上解决这一问题,就需要对其中存在的故障进行预测,只有这样才能保证超级计算机的使用安全,提高其稳定性。基于此,本文对E级超级计算机故障预测的数据采集方法进行了简单的研究。

作者陆鹏 LU Peng

机构地区渤海大学信息科学与技术学院

出处《信息技术与信息化》 2017年第1期60-62,共3页 Information Technology and Informatization

关键词 E级超级计算机故障预算数据采集

分类号 TP338 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献2

1胡维,蒋艳凰,刘光明,董文睿,崔新武.E级超级计算机故障预测的数据采集方法[J].国防科技大学学报,2016,38(1):93-100. 被引量：6
2李永华,何兆照.浅析E级超级计算机故障预测的数据采集方法[J].电脑迷,2016(8). 被引量：1

二级参考文献15

1Ptulp I R. Software failures and the road to a petaflop machine[ C ]// Proceedings of the 11 th International Symposium on High Performance Computer Architecture, San Francisco, CA, USA, IEEE Computer Society, 2005.
2Liang Y, Zhang Y, Xiong H, et al. Failure prediction in IBM BlueGene/L event logs [ C ]//Proceedings of Seventh IEEE International Conference on Data Mining Omaha, Nebraska, USA, IEEE Computer Society, 2007:583 - 588 .
3LanZ L, Gu J X, Zheng Z M, et al. A study of dynamic meta-learning for failure prediction in large-scale systems[J]. Journal of Parallel and Distributed Computing, 2010, 70 (6) : 630 - 643.
4Oliner A, Ganapathi A, Xu W. Advances and challenges in log analysis [ J ]. Communications of the ACM , 2012, 55(2): 55 -61.
5Xu W, Huang L, Fox A, et al. Detecting large-scale system problems by mining console logs [ C ]//Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles,New York, NY, USA: ACM, 2009.
6Gainaru A, Cappello F, Snir M, et al. Fault prediction under the microscope: a closer look into HPC systems [ C ]//Proceedings of the International Conference High Performance Computing, Networking, Storage and Analysis, Los Alamitos, CA, USA, IEEE Computer Society Press, 2012.
7Scott S L, Engelmann C, Vallre G R, et al. A tunable holistic resiliency approach for high-performance computing systems [ C ]//Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,New York, NY, USA,ACM, 2009.
8Nagarajan A B, Mueller F, Engelmann C, et al. Proactive fault tolerance for HPC with Xen virtualization [ C ]// Proceedings of the 21st Annual International Conference on Supercomputing,New York, NY, USA, ACM, 2007: 23- 32.
9Rajachandrasekar R, Besseron X, Panda D K. Monitoring and predicting hardware failures in HPC clusters with FTB- IPMI[ C]//Proceedings of the 2012 IEEE 26th Iuternational Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012 : 1136 - 1143.
10Sahoo R K, Oliner A J, Rish I, et al. Critical event prediction for proactive management in large-scale computer clusters [ C ]//Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,New York, NY, USA, ACM, 2003:426-435.

共引文献5

1陈曰平.计算机在交通公路工程数据采集中的应用[J].科技创新导报,2016,13(19):4-4. 被引量：2
2李永华,何兆照.浅析E级超级计算机故障预测的数据采集方法[J].电脑迷,2016(8). 被引量：1
3王强.实验室网络交换机测试数据自动采集仿真[J].计算机仿真,2019,36(7):371-375. 被引量：3
4白雪松.基于WinPcap的远程网络微型数据堆叠式采集仿真[J].计算机仿真,2020,37(1):333-337. 被引量：2
5建澜涛,任秀江,张祯,石嵩,黄益明,张春林.E级高性能计算机的维护故障诊断系统研究[J].计算机工程,2022,48(12):24-37. 被引量：6

1厉铁帅,陈鸣.基于SNMP的IP网络流监测系统的设计与实现[J].军事通信技术,2006,27(2):67-70.
2李永华,何兆照.浅析E级超级计算机故障预测的数据采集方法[J].电脑迷,2016(8). 被引量：1
3徐有聪,叶培鹏,霍红建.数字高程模型（DEM）数据采集方法及对比分析[J].科技与生活,2012(2):214-214. 被引量：1
4宋欣蔚.计算机网络可靠性研究[J].信息与电脑,2016,28(4):158-159. 被引量：2
5李群英.计算机网络可靠性研究[J].中国电子商务,2012(4):51-52. 被引量：1
6杨超.计算机网络可靠性的探讨[J].都市家教（上半月）,2014(8):155-155.
7杨晓虎.企业信息系统的可靠性分析[J].电子产品可靠性与环境试验,2013,31(A01):128-130.
8刘华.讨论提高计算机网络可靠性的方法[J].数字技术与应用,2013,31(7):215-215. 被引量：1
9黄穆.计算机网络可靠性研究[J].大观周刊,2012(31):124-124. 被引量：1
10张勇.基于ACCESS数据库的CAN总线数据采集方法的设计与实现[J].制造业自动化,2011,33(21):68-70. 被引量：2

信息技术与信息化

2017年第1期

浏览历史

内容加载中请稍等...

浅析E级超级计算机故障预测的数据采集方法

参考文献2

二级参考文献15

共引文献5

相关作者

相关机构

相关主题

浏览历史