摘要
2D计算阵列由于高并行性且通信简单,在深度学习加速器(deep learning accelerator,DLA)中经常负责处理卷积的大量计算,若出现硬件故障,则会导致计算错误,从而造成预测精度大幅下降。为了修复2D计算阵列中的故障,文章提出一种用于容错DLA的重计算结构(recomputing architecture,RCA),与传统的在阵列中添加冗余的即时故障修复策略不同,它具有一组基于冗余的重计算单元(recomputing unit,RCU),可以在稍后的周期中一对一地进行故障单元的重新计算。实验结果表明,与之前的容错方案相比,该文提出的方法显示出更高的故障修复能力和可扩展性,并且芯片面积占用更少。
Due to its high parallelism and simple communication,2D computing arrays in deep learning accelerator(DLA)are often responsible for processing a large number of calculations of convolution.If there is a hardware failure,the calculation error will result in a significant decrease in the prediction accuracy.In order to fix faults in 2D computing arrays,this paper proposes a recomputing architecture(RCA)for fault-tolerant DLA,which is different from the traditional real-time fault repair strategy of adding redundancy in the array.It has a set of redundancy-based recomputing units(RCU)that can be used to recomputing the failure units one-to-one later in the cycle.Experimental results show that,compared with the previous fault-tolerant schemes,the proposed method has higher fault repair capability and scalability,and less chip area occupancy.
作者
王乾龙
许达文
WANG Qianlong;XU Dawen(School of Electronic Science and Applied Physics,Hefei University of Technology,Hefei 230601,China)
出处
《合肥工业大学学报(自然科学版)》
CAS
北大核心
2023年第1期54-59,共6页
Journal of Hefei University of Technology:Natural Science
基金
国家自然科学基金资助项目(61834006)。