This paper introduces the development of the exascale (10^18) computing. Though exascalc computing is a hot research direction worldwide, we are facing many challenges in the areas of memory wall, communica- tion wa...This paper introduces the development of the exascale (10^18) computing. Though exascalc computing is a hot research direction worldwide, we are facing many challenges in the areas of memory wall, communica- tion wall, reliability wall, power wall and scalability of parallel computing. According to these challenges, some thoughts and strategies are proposed.展开更多
Facing the challenges of the next generation exascale computing,National University of Defense Technology has developed a prototype system to explore opportunities,solutions,and limits toward the next generation Tianh...Facing the challenges of the next generation exascale computing,National University of Defense Technology has developed a prototype system to explore opportunities,solutions,and limits toward the next generation Tianhe system.This paper briefly introduces the prototype system,which is deployed at the National Supercomputer Center in Tianjin and has a theoretical peak performance of 3.15 Pflops.A total of 512 compute nodes are found where each node has three proprietary CPUs called Matrix-2000+.The system memory is 98.3 TB,and the storage is 1.4 PB in total.展开更多
The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overa...The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall' from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup,defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing.展开更多
Recently,researchers have shown increasing interest in combining more than one programming model into systems running on high performance computing systems(HPCs)to achieve exascale by applying parallelism at multiple ...Recently,researchers have shown increasing interest in combining more than one programming model into systems running on high performance computing systems(HPCs)to achieve exascale by applying parallelism at multiple levels.Combining different programming paradigms,such as Message Passing Interface(MPI),Open Multiple Processing(OpenMP),and Open Accelerators(OpenACC),can increase computation speed and improve performance.During the integration of multiple models,the probability of runtime errors increases,making their detection difficult,especially in the absence of testing techniques that can detect these errors.Numerous studies have been conducted to identify these errors,but no technique exists for detecting errors in three-level programming models.Despite the increasing research that integrates the three programming models,MPI,OpenMP,and OpenACC,a testing technology to detect runtime errors,such as deadlocks and race conditions,which can arise from this integration has not been developed.Therefore,this paper begins with a definition and explanation of runtime errors that result fromintegrating the three programming models that compilers cannot detect.For the first time,this paper presents a classification of operational errors that can result from the integration of the three models.This paper also proposes a parallel hybrid testing technique for detecting runtime errors in systems built in the C++programming language that uses the triple programming models MPI,OpenMP,and OpenACC.This hybrid technology combines static technology and dynamic technology,given that some errors can be detected using static techniques,whereas others can be detected using dynamic technology.The hybrid technique can detect more errors because it combines two distinct technologies.The proposed static technology detects a wide range of error types in less time,whereas a portion of the potential errors that may or may not occur depending on the 4502 CMC,2023,vol.74,no.2 operating environment are left to the dynamic technology,which completes the validation.展开更多
With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for sc...With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.展开更多
文摘This paper introduces the development of the exascale (10^18) computing. Though exascalc computing is a hot research direction worldwide, we are facing many challenges in the areas of memory wall, communica- tion wall, reliability wall, power wall and scalability of parallel computing. According to these challenges, some thoughts and strategies are proposed.
基金supported by the National Key Research and Development Program of China(No.2016YFB0200401)。
文摘Facing the challenges of the next generation exascale computing,National University of Defense Technology has developed a prototype system to explore opportunities,solutions,and limits toward the next generation Tianhe system.This paper briefly introduces the prototype system,which is deployed at the National Supercomputer Center in Tianjin and has a theoretical peak performance of 3.15 Pflops.A total of 512 compute nodes are found where each node has three proprietary CPUs called Matrix-2000+.The system memory is 98.3 TB,and the storage is 1.4 PB in total.
基金the National Natural Science Foundation of China(Nos.61272141 and 61120106005)the National High-Tech R&D Program(863)of China(No.2012AA01A301)
文摘The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall' from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup,defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing.
基金[King Abdulaziz University][Deanship of Scientific Research]Grant Number[KEP-PHD-20-611-42].
文摘Recently,researchers have shown increasing interest in combining more than one programming model into systems running on high performance computing systems(HPCs)to achieve exascale by applying parallelism at multiple levels.Combining different programming paradigms,such as Message Passing Interface(MPI),Open Multiple Processing(OpenMP),and Open Accelerators(OpenACC),can increase computation speed and improve performance.During the integration of multiple models,the probability of runtime errors increases,making their detection difficult,especially in the absence of testing techniques that can detect these errors.Numerous studies have been conducted to identify these errors,but no technique exists for detecting errors in three-level programming models.Despite the increasing research that integrates the three programming models,MPI,OpenMP,and OpenACC,a testing technology to detect runtime errors,such as deadlocks and race conditions,which can arise from this integration has not been developed.Therefore,this paper begins with a definition and explanation of runtime errors that result fromintegrating the three programming models that compilers cannot detect.For the first time,this paper presents a classification of operational errors that can result from the integration of the three models.This paper also proposes a parallel hybrid testing technique for detecting runtime errors in systems built in the C++programming language that uses the triple programming models MPI,OpenMP,and OpenACC.This hybrid technology combines static technology and dynamic technology,given that some errors can be detected using static techniques,whereas others can be detected using dynamic technology.The hybrid technique can detect more errors because it combines two distinct technologies.The proposed static technology detects a wide range of error types in less time,whereas a portion of the potential errors that may or may not occur depending on the 4502 CMC,2023,vol.74,no.2 operating environment are left to the dynamic technology,which completes the validation.
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, tile National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04 and tile Beijing Municipal Science and Technology Commission under Grant No. Z15010101009, the Open Project Program of State Key Laboratory of Computer Architecture under Grant No. CARCH201503, China Scholarship Council, and Beijing Advanced hmovation Center for hnaging Technology.
文摘With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.