With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for sc...With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.展开更多
Emerging byte-addressable non-volatile memory technologies, such as phase change memory (PCM) and spin- transfer torque RAM (STT-RAM), offer both the byte-addressability of memory and the durability of storage, th...Emerging byte-addressable non-volatile memory technologies, such as phase change memory (PCM) and spin- transfer torque RAM (STT-RAM), offer both the byte-addressability of memory and the durability of storage, thus making it feasible to build single-level store systems. To ensure the consistency of persistent data structures in the presence of power failures or system crashes, it requires flushing cache lines to persistent memory frequently, thus incurring non-trivial synchronization overhead. To mitigate this issue, we propose two techniques. First, we use non-volatile STT-RAM as scratchpad memory on chip to store recovery information, thereby eliminating synchronization cost in the logging phase due to the avoidance of off-chip logging operations. Second, we present an adaptive synchronization policy based on caching modes in terms of data access patterns, thereby eliminating unnecessary synchronization cost in the checkpoint phase. Evaluation results indicate that the two techniques improve the overall performance from 2.15x to 2.39x compared with conventional transactional persistent memory.展开更多
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, tile National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04 and tile Beijing Municipal Science and Technology Commission under Grant No. Z15010101009, the Open Project Program of State Key Laboratory of Computer Architecture under Grant No. CARCH201503, China Scholarship Council, and Beijing Advanced hmovation Center for hnaging Technology.
文摘With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502321, 61472260, and 61402302, the Beijing Natural Science Foundation under Grant No. 4143060, the Overseas Visiting Scholar Program of Beijing under Grant No. 067135300100, the State Key Laboratory of Computer Architecture of China under Grant No. CARCH201503, and the Beijing Innovative Teams and Teacher Career Development Program under Grant No. IDHT20150507.
文摘Emerging byte-addressable non-volatile memory technologies, such as phase change memory (PCM) and spin- transfer torque RAM (STT-RAM), offer both the byte-addressability of memory and the durability of storage, thus making it feasible to build single-level store systems. To ensure the consistency of persistent data structures in the presence of power failures or system crashes, it requires flushing cache lines to persistent memory frequently, thus incurring non-trivial synchronization overhead. To mitigate this issue, we propose two techniques. First, we use non-volatile STT-RAM as scratchpad memory on chip to store recovery information, thereby eliminating synchronization cost in the logging phase due to the avoidance of off-chip logging operations. Second, we present an adaptive synchronization policy based on caching modes in terms of data access patterns, thereby eliminating unnecessary synchronization cost in the checkpoint phase. Evaluation results indicate that the two techniques improve the overall performance from 2.15x to 2.39x compared with conventional transactional persistent memory.