Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the p...Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the programmer, incurring heavy burden on the programmer and bringing difficulties to inheriting the legacy codes. SF95 is the language developed for FT64 which is the first 64-bit stream processor designed for scientific applications. SF95 conceals SRF from the programmer and leaves the management of SRF to its compiler. In this paper, we present a compiler approach named SRF Coloring to manage SRF automatically. The novelties of this paper are: first, it is the first time to use the graph coloring-based algorithm for the SRF management; second, an algorithm framework for SRF Coloring that is well suited to the FT64 architecture is proposed this framework is based on a well-understood graph coloring algorithm for register allocation, together with some modifications to deal with the unusual aspects of SRF problem; third, the SRF Coloring algorithm is implemented in SF95Compiler, a compiler designed for FT64 and SF95. The experimental results show that our approach represents a practical and promising solution to SRF allocation.展开更多
In modern microprocessors, the multi-port register file is one of the key modules which provides fast and multiple data access for instructions. As the number of access ports in register files increases, stability bec...In modern microprocessors, the multi-port register file is one of the key modules which provides fast and multiple data access for instructions. As the number of access ports in register files increases, stability becomes a key issue due to the voltage fluctuation on bit lines. We propose to apply an isolated inverter to address the voltage fluctuation. To assess the register stability, we derive a closed-form expression of static noise margin (SNM) for our register file. The proposed SNM model can be used as a guideline to predict the impact of several register parameters on the stability and optimize register file designs. To validate the proposed SNM model, we fabricated a test chip of two-write-four-read (2W4R) 1024 bits register file in a TSMC 65 nm low-power CMOS technology. The experimental result shows that the stability of our register file cells with an isolated inverter improve the conventional cells by approximately 2.4 times. Also, the supply voltage causes a fluctuation of SNM of about 65%, while temperature and transistor mismatch cause a fluctuation of SNM of about 20%.展开更多
High-energy particles in the space can easily cause soft error in register file(RF).As a critical structure in a processor,RF often stores data for long periods of time and is read frequently,resulting in a higher pro...High-energy particles in the space can easily cause soft error in register file(RF).As a critical structure in a processor,RF often stores data for long periods of time and is read frequently,resulting in a higher probability of spreading corrupted data to other parts of the processor.The triple modular redundancy(TMR)is a common and effective fault tolerance method that enables multi-bit error correction.Designing full TMR for all the registers could cause excessive area and power overheads.However,some registers in RF have less impact on processor reliability.Therefore,there is no need to design TMR for them.This paper designs an efficient strategy which can rate the registers in RF based on their vulnerability.Based on the proposed strategy,a new RF fault tolerance method named Partial-TMR formulates in this paper,which selectively protects more vulnerable registers against multi-bit error,and improves fault tolerance efficiency.For integer RF,Partial-TMR improves its soft error correction capability by 24.5%relative to the baseline system and 3%relative to ParShield,while for floating-point RF,the improvement comes to 5.17%and 0.58%respectively.The soft error correction capability of Partial-TMR is slightly lower than that of full TMR by 1%to 3%,but Partial-TMR significantly cuts the area and power overheads.Compared with full TMR,Partial-TMR decreases the area and power overheads by 71.6%and 64.9%,respectively.It also has little impact on the performance.Partial-TMR is a more cost-effective fault tolerance method compared with ParShield and full TMR.展开更多
A register file(RF) with 32×32 capacity and 4-read 2-write(4R2W) ports is presented and analyzed in detail.A new output structure using a MUX and a latch is proposed.It eliminates any dynamic or analog circui...A register file(RF) with 32×32 capacity and 4-read 2-write(4R2W) ports is presented and analyzed in detail.A new output structure using a MUX and a latch is proposed.It eliminates any dynamic or analog circuit in the read path,and thus it can improve robustness and reduce power at the same time.We also simplify the timing sequence due to the output scheme.The simplified timing circuit not only cuts down the power but also improves the robustness.In addition,less power is achieved when successive read of"0"or"1"is performed.The RF has been fabricated in TSMC 65 nm technology,and the chip test demonstrates that it can operate at 0.8 GHz,consuming 7.2 mW at 1.2 V.展开更多
a low-power register file is designed by using a low-swing strategy and modified NAND address decoders. The proposed low-swing strategy is based on the feedback scheme and uses dynamic logic to reduce the active feedb...a low-power register file is designed by using a low-swing strategy and modified NAND address decoders. The proposed low-swing strategy is based on the feedback scheme and uses dynamic logic to reduce the active feedback power.This method contains two parts:WRITE and READ strategy.In the WRITE low-swing scheme,the modified memory cell is used to support low-swing WRITE.The modified NAND decoder not only dissipates less power,but also enables a great deal of area reduction.Compared with the conventional single-ended register file,the low-swing strategy saves 34.5%and 51.15%bit-line power in WRITE and READ separately.The post simulation results indicate a 39.4%power improvement when the twelve ports are all busy.展开更多
The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads tha...The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.展开更多
5G baseband signal processing places greater real-time and reliability requirements on hardware.Based on the architecture of the MaPU,a reconfigurable computing architecture is proposed according to the characteristic...5G baseband signal processing places greater real-time and reliability requirements on hardware.Based on the architecture of the MaPU,a reconfigurable computing architecture is proposed according to the characteristics of the 5G baseband signal processing.A dedicated instruction set for 5G baseband signal processing is proposed.The corresponding functional units are designed for reuse of hardware resources.A redirected register file is proposed to address latency and power consumption issues in internetwork.A two-dimensional code compression scheme is proposed for cases in which the use ratio of instruction memory is low.The access mode of the data memory is extended,the performance is improved and the power consumption is reduced.The throughput of 5G baseband processing algorithm is one to two orders of magnitude higher than that of the TMS320C6670 with less power consumption.The silicon area evaluated by layout is 5.8 mm2,which is 1/6 of the MaPU’s.The average power consumption is 0.7 W,which is 1/5 of the MaPU’s.展开更多
Matrix inversion is a critical part in communication, signal processing and electromagnetic system. A flexible and scalable very long instruction word (VLIW) processor with clustered architecture is proposed for mat...Matrix inversion is a critical part in communication, signal processing and electromagnetic system. A flexible and scalable very long instruction word (VLIW) processor with clustered architecture is proposed for matrix inversion. A global register file (RF) is used to connect al the clusters. Two nearby clusters share a local register file. The instruction sets are also designed for the VLIW processor. Experimental results show that the proposed VLIW architecture takes only 45 latency to invert a 4 × 4 matrix when running at 150 MHz. The proposed design is roughly five times faster than the DSP solution in processing speed.展开更多
The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the archite...The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the architectural design of a six-way VLIW digital signal processor(DSP) with clustered register files.The architecture uses a variable length instruction set and supports dynamic instruction dispatching.The one-level memory system architecture of the processor includes 16-KB instruction and data caches and 16-KB instruction and data on-chip RAM.A compiler based on the Open64 was developed for the system.Evaluations show that the processor is suitable for high performance applications with a high code density and small program code size.展开更多
基金Supported by the National Natural Science Foundation of China under Grant Nos.60621003 and 60633050.
文摘Stream Register File (SRF) is a large on-chip memory of the stream processor and its efficient management is essential for good performance. Current stream programming languages expose the management of SRF to the programmer, incurring heavy burden on the programmer and bringing difficulties to inheriting the legacy codes. SF95 is the language developed for FT64 which is the first 64-bit stream processor designed for scientific applications. SF95 conceals SRF from the programmer and leaves the management of SRF to its compiler. In this paper, we present a compiler approach named SRF Coloring to manage SRF automatically. The novelties of this paper are: first, it is the first time to use the graph coloring-based algorithm for the SRF management; second, an algorithm framework for SRF Coloring that is well suited to the FT64 architecture is proposed this framework is based on a well-understood graph coloring algorithm for register allocation, together with some modifications to deal with the unusual aspects of SRF problem; third, the SRF Coloring algorithm is implemented in SF95Compiler, a compiler designed for FT64 and SF95. The experimental results show that our approach represents a practical and promising solution to SRF allocation.
基金supported by the National Natural Science Foundation of China(Nos,61404076,61474068)the Zhejiang Provincial Natural Science Foundation of China(No.LQ14F040001)+3 种基金the S&T Plan of Zhejiang Provincial Science and Technology Department(No.2015C31010)the China Spark Program(No.2015GA701053)the Ningbo Natural Science Foundation(Nos.2014A610148,2015A610107)the K.C.Wong Magna Fund in Ningbo University,China
文摘In modern microprocessors, the multi-port register file is one of the key modules which provides fast and multiple data access for instructions. As the number of access ports in register files increases, stability becomes a key issue due to the voltage fluctuation on bit lines. We propose to apply an isolated inverter to address the voltage fluctuation. To assess the register stability, we derive a closed-form expression of static noise margin (SNM) for our register file. The proposed SNM model can be used as a guideline to predict the impact of several register parameters on the stability and optimize register file designs. To validate the proposed SNM model, we fabricated a test chip of two-write-four-read (2W4R) 1024 bits register file in a TSMC 65 nm low-power CMOS technology. The experimental result shows that the stability of our register file cells with an isolated inverter improve the conventional cells by approximately 2.4 times. Also, the supply voltage causes a fluctuation of SNM of about 65%, while temperature and transistor mismatch cause a fluctuation of SNM of about 20%.
文摘High-energy particles in the space can easily cause soft error in register file(RF).As a critical structure in a processor,RF often stores data for long periods of time and is read frequently,resulting in a higher probability of spreading corrupted data to other parts of the processor.The triple modular redundancy(TMR)is a common and effective fault tolerance method that enables multi-bit error correction.Designing full TMR for all the registers could cause excessive area and power overheads.However,some registers in RF have less impact on processor reliability.Therefore,there is no need to design TMR for them.This paper designs an efficient strategy which can rate the registers in RF based on their vulnerability.Based on the proposed strategy,a new RF fault tolerance method named Partial-TMR formulates in this paper,which selectively protects more vulnerable registers against multi-bit error,and improves fault tolerance efficiency.For integer RF,Partial-TMR improves its soft error correction capability by 24.5%relative to the baseline system and 3%relative to ParShield,while for floating-point RF,the improvement comes to 5.17%and 0.58%respectively.The soft error correction capability of Partial-TMR is slightly lower than that of full TMR by 1%to 3%,but Partial-TMR significantly cuts the area and power overheads.Compared with full TMR,Partial-TMR decreases the area and power overheads by 71.6%and 64.9%,respectively.It also has little impact on the performance.Partial-TMR is a more cost-effective fault tolerance method compared with ParShield and full TMR.
基金Project supported by the National Significant Science and Technology Projects(No.01-Special-2010ZX01030-001-001-03)
文摘A register file(RF) with 32×32 capacity and 4-read 2-write(4R2W) ports is presented and analyzed in detail.A new output structure using a MUX and a latch is proposed.It eliminates any dynamic or analog circuit in the read path,and thus it can improve robustness and reduce power at the same time.We also simplify the timing sequence due to the output scheme.The simplified timing circuit not only cuts down the power but also improves the robustness.In addition,less power is achieved when successive read of"0"or"1"is performed.The RF has been fabricated in TSMC 65 nm technology,and the chip test demonstrates that it can operate at 0.8 GHz,consuming 7.2 mW at 1.2 V.
基金Proiect supported by the National Science and Technology Maior Proiects of China(No.2009ZX01034-001-002-005)
文摘a low-power register file is designed by using a low-swing strategy and modified NAND address decoders. The proposed low-swing strategy is based on the feedback scheme and uses dynamic logic to reduce the active feedback power.This method contains two parts:WRITE and READ strategy.In the WRITE low-swing scheme,the modified memory cell is used to support low-swing WRITE.The modified NAND decoder not only dissipates less power,but also enables a great deal of area reduction.Compared with the conventional single-ended register file,the low-swing strategy saves 34.5%and 51.15%bit-line power in WRITE and READ separately.The post simulation results indicate a 39.4%power improvement when the twelve ports are all busy.
基金This work was supported by the National Natural Science Foundation of China under Grant No. 61300005.
文摘The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.
基金Project(XDA-06010402)supported by the Strategic Priority Research Program of Chinese Academy of SciencesProject(Y5S7061G51)supported by the Youth Innovation Promotion Association of Chinese Academy of Sciences
文摘5G baseband signal processing places greater real-time and reliability requirements on hardware.Based on the architecture of the MaPU,a reconfigurable computing architecture is proposed according to the characteristics of the 5G baseband signal processing.A dedicated instruction set for 5G baseband signal processing is proposed.The corresponding functional units are designed for reuse of hardware resources.A redirected register file is proposed to address latency and power consumption issues in internetwork.A two-dimensional code compression scheme is proposed for cases in which the use ratio of instruction memory is low.The access mode of the data memory is extended,the performance is improved and the power consumption is reduced.The throughput of 5G baseband processing algorithm is one to two orders of magnitude higher than that of the TMS320C6670 with less power consumption.The silicon area evaluated by layout is 5.8 mm2,which is 1/6 of the MaPU’s.The average power consumption is 0.7 W,which is 1/5 of the MaPU’s.
基金supported by the National Natural Science Foundation of China(6110015561227004+4 种基金613720716137213161201289)the Fundamental Research Funds of the Central Universities of China(K5051302096JB140207)
文摘Matrix inversion is a critical part in communication, signal processing and electromagnetic system. A flexible and scalable very long instruction word (VLIW) processor with clustered architecture is proposed for matrix inversion. A global register file (RF) is used to connect al the clusters. Two nearby clusters share a local register file. The instruction sets are also designed for the VLIW processor. Experimental results show that the proposed VLIW architecture takes only 45 latency to invert a 4 × 4 matrix when running at 150 MHz. The proposed design is roughly five times faster than the DSP solution in processing speed.
基金Supported by the National Natural Science Foundation of China (No.60236020)the Specialized Research Fund for the Doctoral Program of Higher Education of MOE,China (No.20050003083)
文摘The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the architectural design of a six-way VLIW digital signal processor(DSP) with clustered register files.The architecture uses a variable length instruction set and supports dynamic instruction dispatching.The one-level memory system architecture of the processor includes 16-KB instruction and data caches and 16-KB instruction and data on-chip RAM.A compiler based on the Open64 was developed for the system.Evaluations show that the processor is suitable for high performance applications with a high code density and small program code size.