期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
Storage wall for exascale supercomputing 被引量:2
1
作者 Wei HU Guang-ming LIU +2 位作者 Qiong LI yan-huang jiang Gui-lin CAI 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2016年第11期1154-1175,共22页
The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overa... The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall' from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup,defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing. 展开更多
关键词 围住存储的加速 存储墙 高效计算 Exascale 计算 TP338.6
原文传递
FTRP:基于进程复制和预取的高性能计算容错框架(英文)
2
作者 Wei HU Guang-ming LIU yan-huang jiang 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2018年第10期1273-1290,共18页
随着超级计算机规模迅速增大,可靠性成为制约系统可用性的主要问题。现有容错机制,包括检查点技术和进程冗余等,不能有效解决该问题。为此,提出一种基于进程复制和预取的高性能计算容错框架—FTRP(fault tolerance framework using proc... 随着超级计算机规模迅速增大,可靠性成为制约系统可用性的主要问题。现有容错机制,包括检查点技术和进程冗余等,不能有效解决该问题。为此,提出一种基于进程复制和预取的高性能计算容错框架—FTRP(fault tolerance framework using process replication and prefetching),该框架兼具主动和被动容错机制的优点,引入创新的开销模型和主动容错机制,能够有效改善应用运行效率。提出"工作最多"(work-most,WM)的创新开销模型,基于故障预测结果和应用状态,从容错机制集中在线自适应给出运行容错决策。与程序运行过程中的局部性相似,我们第一次观察到超级计算机故障局部性现象。基于故障局部性,提出一种新的进程复制和进程预取相结合的容错机制,无论故障能否被预测到,都能够有效避免故障引起的损失。通过基于实际故障路径和普通故障预测准确率的模拟实验,并采用FTRP容错框架的应用,可以获得比现有容错机制10%的改进,且在P级甚至更大规模系统上有效。 展开更多
关键词 High-performance computing PROACTIVE fault TOLERANCE Failure LOCALITY PROCESS REPLICATION PROCESS PREFETCHING
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部