The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overa...The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall' from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup,defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing.展开更多
随着超级计算机规模迅速增大,可靠性成为制约系统可用性的主要问题。现有容错机制,包括检查点技术和进程冗余等,不能有效解决该问题。为此,提出一种基于进程复制和预取的高性能计算容错框架—FTRP(fault tolerance framework using proc...随着超级计算机规模迅速增大,可靠性成为制约系统可用性的主要问题。现有容错机制,包括检查点技术和进程冗余等,不能有效解决该问题。为此,提出一种基于进程复制和预取的高性能计算容错框架—FTRP(fault tolerance framework using process replication and prefetching),该框架兼具主动和被动容错机制的优点,引入创新的开销模型和主动容错机制,能够有效改善应用运行效率。提出"工作最多"(work-most,WM)的创新开销模型,基于故障预测结果和应用状态,从容错机制集中在线自适应给出运行容错决策。与程序运行过程中的局部性相似,我们第一次观察到超级计算机故障局部性现象。基于故障局部性,提出一种新的进程复制和进程预取相结合的容错机制,无论故障能否被预测到,都能够有效避免故障引起的损失。通过基于实际故障路径和普通故障预测准确率的模拟实验,并采用FTRP容错框架的应用,可以获得比现有容错机制10%的改进,且在P级甚至更大规模系统上有效。展开更多
基金the National Natural Science Foundation of China(Nos.61272141 and 61120106005)the National High-Tech R&D Program(863)of China(No.2012AA01A301)
文摘The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall' from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup,defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing.
基金Project supported by the National Natural Science Foundation of China(Nos.61272141,61120106005,and 61303068)the National High-Tech R&D Program of China(No.2012AA01A301)
文摘随着超级计算机规模迅速增大,可靠性成为制约系统可用性的主要问题。现有容错机制,包括检查点技术和进程冗余等,不能有效解决该问题。为此,提出一种基于进程复制和预取的高性能计算容错框架—FTRP(fault tolerance framework using process replication and prefetching),该框架兼具主动和被动容错机制的优点,引入创新的开销模型和主动容错机制,能够有效改善应用运行效率。提出"工作最多"(work-most,WM)的创新开销模型,基于故障预测结果和应用状态,从容错机制集中在线自适应给出运行容错决策。与程序运行过程中的局部性相似,我们第一次观察到超级计算机故障局部性现象。基于故障局部性,提出一种新的进程复制和进程预取相结合的容错机制,无论故障能否被预测到,都能够有效避免故障引起的损失。通过基于实际故障路径和普通故障预测准确率的模拟实验,并采用FTRP容错框架的应用,可以获得比现有容错机制10%的改进,且在P级甚至更大规模系统上有效。