As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit(...As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit(GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor(SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU's poor warp scheduling method.Thus, benefits of GPU's high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected(CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter(LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit(PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.展开更多
SRAM(static random access memory)-based FPGA(field programmable gate array), owing to its large capacity, high performance, and dynamical reconfiguration, has become an attractive platform for So PC(system on programm...SRAM(static random access memory)-based FPGA(field programmable gate array), owing to its large capacity, high performance, and dynamical reconfiguration, has become an attractive platform for So PC(system on programmable chip) development. However, as the configuration memory and logic memory of the SRAM-based FPGA are highly susceptible to SEUs(single-event upsets) in deep space, it is a challenge to design and implement a highly reliable FPGA-based system for spacecraft, and no practical architecture has been proposed. In this paper, a new architecture for a reliable and reconfigurable FPGAbased computer in a highly critical GNC(guidance navigation and control) system is proposed. To mitigate the effect of an SEU on the system, multi-layer reconfiguration and multi-layer TMR(triple module redundancy) techniques are proposed, with a reliable reconfigurable real-time operating system(Space OS) managing the system level fault tolerance of the computer in the architecture. The proposed architecture for the reconfigurable FPGA-based computer has been implemented with COTS(commercial off the shelf) FPGA and has firstly been applied to the GNC system of a circumlunar return and reentry flight vehicle. The in-orbit results show that the proposed architecture is capable of meeting the requirements of high reliability and high availability, and can provide the expressive varying functionality and runtime flexibility for an FPGA-based GNC computer in deep space.展开更多
Dynamic random access memory(DRAM) is facing the challenge of technology scaling. The decreasing feature size makes it harder to make DRAM cells which can keep the current data-holding time. When DRAM cells cannot hol...Dynamic random access memory(DRAM) is facing the challenge of technology scaling. The decreasing feature size makes it harder to make DRAM cells which can keep the current data-holding time. When DRAM cells cannot hold data for a long time, DRAM chips need a more frequent refreshing operation. Therefore, in the near future, time and energy cost on DRAM refreshing will be no longer trivial. In this paper, we propose DRAM Error Correction Pointer(ECP), an error-correction-manner framework, to reduce DRAM refreshes without data loss. We exploit the non-uniform feature of DRAM cells with respect to the data retention time. Compared with the conventional refreshing mechanisms, which refresh DRAM chips by the retention time of the leakiest cells, we refresh the chips much fewer times, and treat the not-in-time refreshed cells as fault elements. We use the structure of ECP as a fault tolerant element. By recording the data which are supposed to be written into the leaky cells in our DRAM-ECP structures, DRAM-ECP can significantly decrease refreshing frequency. When these data are to be read out, DRAM-ECP retrieves the data stored in ECPs and covers them to the corresponding position in the data row. Our experiments show that DRAM-ECP can reduce over 70%refreshing operations than the current refreshing mechanism and also get significant energy saving.展开更多
基金Project supported by the National Natural Science Foundation of China(No.61170083)the Specialized Research Fund for the Doctoral Program of Higher Education,China(No.20114307110001)
文摘As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit(GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor(SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU's poor warp scheduling method.Thus, benefits of GPU's high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected(CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter(LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit(PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.
基金supported by the Major Special Projects on National Medium and Long-term Science and Technology Development Planning
文摘SRAM(static random access memory)-based FPGA(field programmable gate array), owing to its large capacity, high performance, and dynamical reconfiguration, has become an attractive platform for So PC(system on programmable chip) development. However, as the configuration memory and logic memory of the SRAM-based FPGA are highly susceptible to SEUs(single-event upsets) in deep space, it is a challenge to design and implement a highly reliable FPGA-based system for spacecraft, and no practical architecture has been proposed. In this paper, a new architecture for a reliable and reconfigurable FPGAbased computer in a highly critical GNC(guidance navigation and control) system is proposed. To mitigate the effect of an SEU on the system, multi-layer reconfiguration and multi-layer TMR(triple module redundancy) techniques are proposed, with a reliable reconfigurable real-time operating system(Space OS) managing the system level fault tolerance of the computer in the architecture. The proposed architecture for the reconfigurable FPGA-based computer has been implemented with COTS(commercial off the shelf) FPGA and has firstly been applied to the GNC system of a circumlunar return and reentry flight vehicle. The in-orbit results show that the proposed architecture is capable of meeting the requirements of high reliability and high availability, and can provide the expressive varying functionality and runtime flexibility for an FPGA-based GNC computer in deep space.
基金supported by National High-tech R&D Program of China(863 Program)(Grant No.2012AA01A3-02)National Natural Science Foundation of China(Grant Nos.61133004+2 种基金61361126011)State Key Laboratory of Software Development Environment(Grant No.SKLSDE-2013ZX-22)Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing
文摘Dynamic random access memory(DRAM) is facing the challenge of technology scaling. The decreasing feature size makes it harder to make DRAM cells which can keep the current data-holding time. When DRAM cells cannot hold data for a long time, DRAM chips need a more frequent refreshing operation. Therefore, in the near future, time and energy cost on DRAM refreshing will be no longer trivial. In this paper, we propose DRAM Error Correction Pointer(ECP), an error-correction-manner framework, to reduce DRAM refreshes without data loss. We exploit the non-uniform feature of DRAM cells with respect to the data retention time. Compared with the conventional refreshing mechanisms, which refresh DRAM chips by the retention time of the leakiest cells, we refresh the chips much fewer times, and treat the not-in-time refreshed cells as fault elements. We use the structure of ECP as a fault tolerant element. By recording the data which are supposed to be written into the leaky cells in our DRAM-ECP structures, DRAM-ECP can significantly decrease refreshing frequency. When these data are to be read out, DRAM-ECP retrieves the data stored in ECPs and covers them to the corresponding position in the data row. Our experiments show that DRAM-ECP can reduce over 70%refreshing operations than the current refreshing mechanism and also get significant energy saving.