摘要
基于神威太湖之光和神威蓝光超级计算机的巨量内存故障统计数据,建立P级超级计算机的内存失效时间模型。采用序列规则挖掘方法,分析内存失效序列模式,得到CPU节点上内存失效序列与后续内存失效的关联关系。通过协同分析方法研究并行应用的内存故障与内存失效特征,结果表明计算-访存-I/O密集型应用对内存故障影响较大,而应用类型对内存失效的影响有限,内存失效可能与内存芯片自身的可靠性有关。
Based on the massive amount of statistical data about memory faults on Sunway TaihuLight and Sunway BlueLight supercomputers,the memory failure time model for Petascale supercomputers is built.By sequential rule mining,the sequential pattern of memory failures is analyzed and the correlation relationship between memory failure sequences and the following memory failure on CPU nodes is found.The characteristics of memory faults and failures on parallel applications are studied by the co-analysis method.Results show that computing-memory-I/O intensive applications have large impact on memory faults while the type of applications has limited impact on memory failures,which,however,may have correlation relationship with the reliability of memory chips.
作者
刘睿涛
陈左宁
LIU Ruitao;CHEN Zuoning(State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214215,China;National Research Center of Parallel Computer Engineering and Technology,Beijing 100190,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2019年第5期35-45,共11页
Computer Engineering
基金
国家重点研发计划(2016YFB0200502)
关键词
超级计算机
内存故障
内存失效
统计数据
失效模型
关联关系
协同分析
supercomputer
memory fault
memory failure
statistical data
failure model
correlation relationship
co-analysis