摘要
存储器可靠性问题是构建E级计算系统的关键挑战之一.存储器故障占计算机系统硬件故障的40%以上,随着存储器数量增加、存储器密度扩展和接口速率提升,E级计算机中存储器和访存传输通路的可靠性问题将会愈发严峻,传统的SEC-DED汉明码的纠检错能力难以满足E级系统高可靠性的需求.RS码是一种纠错能力很强的多项式编码,可实现Chipkill技术,然而,可纠多符号错的RS码的译码电路复杂,直接应用于存储器领域较为困难.本文提出了一种基于RS码和重传机制的内存可靠性增强技术——R-RS(Retransmission-RS),通过精心挑选本原多项式和校验矩阵设计了具有低硬件实现开销的RS编码,并通过精细化电路设计实现了并行高效低延迟译码,提出了基于窗口保序的重传机制对传输链路上的偶发故障所致错误进行重传,R-RS可纠正4个8位符号错,能够有效应对传输链路和存储器内部的随机单比特错、突发错以及传输链路偶发错误.R-RS的冗余存储开销为12.5%,性能开销是额外的1拍译码延迟,其面积仅占整个存储控制器的3.5%,与同类别的E-ECC方案相比,其纠正双颗粒、三颗粒突发错的能力分别提升了83.3%和109.5%,而其误纠概率降低了97.8%,利用存储器实际出错模型参数进行仿真,结果显示R-RS的平均纠错能力相较于E-ECC提高了31%;R-RS的重传功能在实际系统中使访存失效率降低了42.1%.R-RS应用在新一代神威E级计算机系统后,使系统的平均无故障运行时间增加了35.3倍,表明R-RS是一种有效的面向E级计算的内存可靠性解决方案.
Memory reliability is one of the key challenges for building Exascale supercomputers.The faults of memory account for more than 40%of the hardware faults in computer systems,and with aggressive storage density scaling in memory,lowering the supply voltage and the increasing of the memory interface rate and the number of memory devices,the reliability of memory device and transmission paths becomes more and more serious in exascale computing systems.The error-correcting capability of traditional SEC-DED(Single Error Correct/Double Error Detect)hamming code can detect and correct random single-bit error,but is not able to correct burst errors,which occurred commonly in supercomputer systems,so it cannot meet the requirements of Exascale supercomputers for high reliability.Reed-Solomon code is a class of symbol-based polynomial code with strong ability to simultaneously correct random errors and burst errors,which is widely used in hard disk protection and communication systems.However,the complicated decoding circuit of RS codes which can correct multi-symbol errors prevents itself from being applied to memory-related architectures.In this paper,R-RS(Retransmission Reed-Solomon),a memory reliability enhancement technology based on RS code and retransmission mechanism,is proposed.By carefully selecting primitive polynomial and parity check matrix,a RS code which has the lowest hardware implementation overhead is designed,and the efficient and low delay parallel decoding is realized through refined circuit design.A retransmission based on window order-preserving is proposed to automatically retransmit the errors caused by the occasional faults in the transmission link.The R-RS is able to correct up to four 8-bit symbol errors,and can effectively deal with random single-bit errors or burst errors happened in transmission links or memory devices.We utilize Design Compiler to synthesize the proposed design in 28 nm technology.The R-RS has 12.5%storage overhead,and the performance overhead of the RS is 2-cycle latency for RS decoding.The area of the R-RS only accounts for 3.5%of the entire memory controller block.Compared with a recent RS code based memory protection counterpart,E-ECC technology,the correctable probability for 2-device burst errors and 3-device burst errors is improved by 83.3%and 109.5%,respectively,while the mistaken correction rate is reduced by 97.8%.Using an in-field memory fault model,we constructed an simulation platform in C language and simulated the correcting ability of SEC-DED,E-ECC,and R-RS,the results demonstrated that the average error correcting ability of R-RS is increased by 31%over E-ECC,and has an order of magnitude advantage over SEC-DED.The retransmission function of R-RS reduces the system error rate caused by memory system failure by 42.1%in real supercomputer system.The R-RS was successfully applied in the new generation of Sunway Exascale supercomputer system,and increased the mean time between errors of the whole system by 35.3 times,indicating that R-RS is an effective memory reliability solution for exascale supercomputer system.
作者
高剑刚
石嵩
郑方
GAO Jian-Gang;SHI Song;ZHENG Fang(National Research Center of Parallel Computer Engineering and Technology,Beijing 100190)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2023年第2期260-273,共14页
Chinese Journal of Computers