摘要
高通量计算系统由海量的计算节点、存储节点通过网络互连而成。由于规模巨大,系统的可靠性成为一个非常严重的问题,部件失效已经成为一种常态,系统设计必须考虑容错的问题。我们需要建立新的高通量计算系统的可靠性保障框架,来适应高通量计算中不同层次的可靠性需求,研究从芯片级到系统级跨层次的可靠计算技术。围绕该目标,该研究从高通量处理芯片的故障检测和容错设计方法,高通量计算系统的失效检测和恢复方法和从芯片级到系统级的故障自预测、自检测、自定位、自隔离和自愈合(5S)支撑环境3方面展开研究。截至2013年各项工作按照任务书原定计划正在稳步推进,部分工作取得阶段性成果。在(1)针对NBTI老化故障的在线预测技术;(2)深度学习等系统故障预测技术;(3)寄存器故障诊断;(4)片上网络通信隔离技术等技术点上取得了突破,共发表录用了IEEE Transactions论文6篇,其他期刊论文1篇。从研究点覆盖来看,部署到研究点已经全部覆盖了任务书规定的所有研究计划,并对某些研究点进行了细化。
High--throughput computing system incorporates massive computing nodes, storage nodes and their associate inner intereonneetion network. It is very common that components of such system will encounter malfunction due to its large scale, which makes reliability an imperative issue thai needs to be considered seriously. In other words, computing system design must take fault tolerance into account. We intend to build unprecedented reliability framework specially for high-throughput computing system, in order to accommodate the desirable reliability demands of various layers in high-throughput computingdesign the corresponding reliable computing techniques across chip level and system level. To achieve this objective, this study commences the relevant research in three consecutive aspects. (1)fault detection/tolerance approaches in high-through computing, (2)malfunction detection/ recovery methods in high-throughput computing system, (3)self--prediction, self--detection, self--isolation and self- healing across chip level and system level (5S supportive environments). Up to the year 2013, various work has been carried on in align with task specification steadily, and parts of the work have reached preset milestones. We have made breakthrough in some researches, such as (1) NBTI aging prediction, (2) fault prediction based on deep learning,(3)register fault diagnosis, and (4) on-chip communication isolation techniques, along with abundant high--rank research publications. In terms of research comprehensiveness, the deployment has covered all research plans defined in the proposal, and some research techniques are further refined as well.
出处
《科技创新导报》
2016年第9期169-169,共1页
Science and Technology Innovation Herald
关键词
可靠性设计
故障检测
深度学习
在线预测
通信隔离
Reliability design
Fault detectionIDeep learning
Online prediction
Communication isolation