A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set process...A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set processor(ASIP), which uses TSE algorithm instead of resource-consuming reciprocal and reciprocal square root(RSR) operations.The aim is to give a high performance implementation for MMSE and QRD in one programmable platform simultaneously.Furthermore, instruction set architecture(ISA) and the allocation of data paths in single instruction multiple data-very long instruction word(SIMD-VLIW) architecture are provided, offering more data parallelism and instruction parallelism for different dimension matrices and operation types.Meanwhile, multiple level numerical precision can be achieved with flexible table size and expansion order in TSE ISA.The ASIP has been implemented to a 28 nm CMOS process and frequency reaches 800 MHz.Experimental results show that the proposed design provides perfect numerical precision within the fixed bit-width of the ASIP, higher matrix processing rate better than the requirements of 5G system and more rate-area efficiency comparable with ASIC implementations.展开更多
Aims to provide the block architecture of CoStar3400 DSP that is a high performance, low power and scalable VLIW DSP core, it efficiently deployed a variable-length execution set (VLES) execution model which utilizes ...Aims to provide the block architecture of CoStar3400 DSP that is a high performance, low power and scalable VLIW DSP core, it efficiently deployed a variable-length execution set (VLES) execution model which utilizes the maximum parallelism by allowing multiple address generations and data arithmetic logic units to execute multiple instructions in a single clock cycle. The scalability was provided mainly in using more or less number of functional units according to the intended application. Low power support was added by careful architectural design techniques such as fine-grain clock gating and activation of only the required number of control signals at each stage of the pipeline. The said features of the core make it a suitable candidate for many SoC configurations, especially for compute intensive applications such as wire-line and wireless communications, including infrastructure and subscriber communications. The embedded system designers can efficiently use the scalability and VLIW features of the core by scaling the number of execution units according to specific needs of the application to effectively reduce the power consumption, chip area and time to market the intended final product.展开更多
Matrix inversion is a critical part in communication, signal processing and electromagnetic system. A flexible and scalable very long instruction word (VLIW) processor with clustered architecture is proposed for mat...Matrix inversion is a critical part in communication, signal processing and electromagnetic system. A flexible and scalable very long instruction word (VLIW) processor with clustered architecture is proposed for matrix inversion. A global register file (RF) is used to connect al the clusters. Two nearby clusters share a local register file. The instruction sets are also designed for the VLIW processor. Experimental results show that the proposed VLIW architecture takes only 45 latency to invert a 4 × 4 matrix when running at 150 MHz. The proposed design is roughly five times faster than the DSP solution in processing speed.展开更多
The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the archite...The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the architectural design of a six-way VLIW digital signal processor(DSP) with clustered register files.The architecture uses a variable length instruction set and supports dynamic instruction dispatching.The one-level memory system architecture of the processor includes 16-KB instruction and data caches and 16-KB instruction and data on-chip RAM.A compiler based on the Open64 was developed for the system.Evaluations show that the processor is suitable for high performance applications with a high code density and small program code size.展开更多
Very Long Instruction Word (VLIW) architectures are commonly used in application-specific domains due to their parallelism and low-power characteristics. Recently, parameterization of such architectures allows for r...Very Long Instruction Word (VLIW) architectures are commonly used in application-specific domains due to their parallelism and low-power characteristics. Recently, parameterization of such architectures allows for runtime adaptation of the issue-width to match the inherent Instruction Level Parallelism (ILP) of an application. One implementation of such an approach is that the event of the issue-width switching dynamically triggers the reconfiguration of the data cache at runtime. In this paper, the relationship between cache resizing and issue-width is well investigated. We have observed that the requirement of the cache does not always correlate with the issue- width of the VLIW processor. To further coordinate the cache resizing with the changing issue-width, we present a novel feedback mechanism to "block" the low yields of cache resizing when the issue-width changes. In this manner, our feedback cache mechanism has a coordinated effort with the issue-width changes, which leads to a noticeable improvement of the cache performance. The experiments show that there is 10% energy savings as well as a 2.3% cache misses decline on average achieved, compared with the cache without the feedback mechanism. Therefore, the feedback mechanism is proven to have the capability to ensure more benefits are achieved from the dynamic and frequent reconfiguration.展开更多
The composite field multiplication is an important and complex module in symmetric cipher algorithms, and its realization performance directly restricts the processing speed of symmetric cipher algorithms. Based on th...The composite field multiplication is an important and complex module in symmetric cipher algorithms, and its realization performance directly restricts the processing speed of symmetric cipher algorithms. Based on the characteristics of composite field multiplication in symmetric cipher algorithms and the realization principle of its reconfigurable architectures, this paper describes the reconfigurable composite field multiplication over GF((2^8)k) (k=1,2,3,4) in RISC (reduced instruction set computer) processor and VLIW (very long instruction word) processor architecture, respectively. Through configuration, the architectures can realize the composite field multiplication over GF(2^8), GF ((2^8)2), GF((28)3) and GF((28)4) flexibly and efficiently. We simulated the function of circuits and synthesized the reconfigurable design based on the 0.18 μm CMOS (complementary metal oxide semiconductor) standard cell library and the comparison with other same kind designs. The result shows that the reconfigurable design proposed in the paper can provide higher efficiency under the premise of flexibility.展开更多
Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches. This paper presents a novel global software pipelining technique, called Th...Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches. This paper presents a novel global software pipelining technique, called Thace Software Pipelining,targeted to the instruction-level parallel processors such as Very Long Instruc-tion Word (VLIW) and superscalar machines. Thace software pipelining applies a global code scheduling technique to compact the original loop body. The re-sulting loop is called a trace software pipelined (TSP) code. The trace softwrae pipelined code can be directly executed with special architectural support or call be transformed into a globally software pipelined loop for the current VLIW and superscalar processors. Thus, exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique. This makes our new technique very promis-ing in practical compilers. Finally, we also present the preliminary experimental results to support our new approach.展开更多
As semi-conductor technologies move down to the nanometer scale, leakage power has become a significant component of the total power consumption. In this paper, we present a leakage-aware modulo scheduling algorithm t...As semi-conductor technologies move down to the nanometer scale, leakage power has become a significant component of the total power consumption. In this paper, we present a leakage-aware modulo scheduling algorithm to achieve leakage energy saving for applications with loops on Very Long Instruction Word (VLIW) architectures. The proposed algorithm is designed to maximize the idleness of function units integrated with the dual-threshold domino logic, and reduce the number of transitions between the active and sleep modes. We have implemented our technique in the Trimaran compiler and conducted experiments using a set of embedded benchmarks from DSPstone and Mibench on the cycle-accurate VLIW simulator of Trimaran. The results show that our technique achieves significant leakage energy saving compared with a previously published DAG-based (Directed Acyclic Graph) leakage-aware scheduling algorithm.展开更多
基金Supported by the Industrial Internet Innovation and Development Project of Ministry of Industry and Information Technology (No.GHBJ2004)。
文摘A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set processor(ASIP), which uses TSE algorithm instead of resource-consuming reciprocal and reciprocal square root(RSR) operations.The aim is to give a high performance implementation for MMSE and QRD in one programmable platform simultaneously.Furthermore, instruction set architecture(ISA) and the allocation of data paths in single instruction multiple data-very long instruction word(SIMD-VLIW) architecture are provided, offering more data parallelism and instruction parallelism for different dimension matrices and operation types.Meanwhile, multiple level numerical precision can be achieved with flexible table size and expansion order in TSE ISA.The ASIP has been implemented to a 28 nm CMOS process and frequency reaches 800 MHz.Experimental results show that the proposed design provides perfect numerical precision within the fixed bit-width of the ASIP, higher matrix processing rate better than the requirements of 5G system and more rate-area efficiency comparable with ASIC implementations.
基金the National Natural Science Foundation of China(Grant No.60425413)COMSATS Institute of Information Technology, Pakistan
文摘Aims to provide the block architecture of CoStar3400 DSP that is a high performance, low power and scalable VLIW DSP core, it efficiently deployed a variable-length execution set (VLES) execution model which utilizes the maximum parallelism by allowing multiple address generations and data arithmetic logic units to execute multiple instructions in a single clock cycle. The scalability was provided mainly in using more or less number of functional units according to the intended application. Low power support was added by careful architectural design techniques such as fine-grain clock gating and activation of only the required number of control signals at each stage of the pipeline. The said features of the core make it a suitable candidate for many SoC configurations, especially for compute intensive applications such as wire-line and wireless communications, including infrastructure and subscriber communications. The embedded system designers can efficiently use the scalability and VLIW features of the core by scaling the number of execution units according to specific needs of the application to effectively reduce the power consumption, chip area and time to market the intended final product.
基金supported by the National Natural Science Foundation of China(6110015561227004+4 种基金613720716137213161201289)the Fundamental Research Funds of the Central Universities of China(K5051302096JB140207)
文摘Matrix inversion is a critical part in communication, signal processing and electromagnetic system. A flexible and scalable very long instruction word (VLIW) processor with clustered architecture is proposed for matrix inversion. A global register file (RF) is used to connect al the clusters. Two nearby clusters share a local register file. The instruction sets are also designed for the VLIW processor. Experimental results show that the proposed VLIW architecture takes only 45 latency to invert a 4 × 4 matrix when running at 150 MHz. The proposed design is roughly five times faster than the DSP solution in processing speed.
基金Supported by the National Natural Science Foundation of China (No.60236020)the Specialized Research Fund for the Doctoral Program of Higher Education of MOE,China (No.20050003083)
文摘The cost of the central register file and the size of the program code limit the scalability of very long instruction word(VLIW) processors with increasing numbers of functional units.This paper presents the architectural design of a six-way VLIW digital signal processor(DSP) with clustered register files.The architecture uses a variable length instruction set and supports dynamic instruction dispatching.The one-level memory system architecture of the processor includes 16-KB instruction and data caches and 16-KB instruction and data on-chip RAM.A compiler based on the Open64 was developed for the system.Evaluations show that the processor is suitable for high performance applications with a high code density and small program code size.
基金supported by the National Natural Science Foundation of China(Nos.61300010 and 61300011)
文摘Very Long Instruction Word (VLIW) architectures are commonly used in application-specific domains due to their parallelism and low-power characteristics. Recently, parameterization of such architectures allows for runtime adaptation of the issue-width to match the inherent Instruction Level Parallelism (ILP) of an application. One implementation of such an approach is that the event of the issue-width switching dynamically triggers the reconfiguration of the data cache at runtime. In this paper, the relationship between cache resizing and issue-width is well investigated. We have observed that the requirement of the cache does not always correlate with the issue- width of the VLIW processor. To further coordinate the cache resizing with the changing issue-width, we present a novel feedback mechanism to "block" the low yields of cache resizing when the issue-width changes. In this manner, our feedback cache mechanism has a coordinated effort with the issue-width changes, which leads to a noticeable improvement of the cache performance. The experiments show that there is 10% energy savings as well as a 2.3% cache misses decline on average achieved, compared with the cache without the feedback mechanism. Therefore, the feedback mechanism is proven to have the capability to ensure more benefits are achieved from the dynamic and frequent reconfiguration.
基金Supported by the National Natural Science Foundation of China(61202492,61309022,61309008)the Natural Science Foundation for Young of Shaanxi Province(2013JQ8013)
文摘The composite field multiplication is an important and complex module in symmetric cipher algorithms, and its realization performance directly restricts the processing speed of symmetric cipher algorithms. Based on the characteristics of composite field multiplication in symmetric cipher algorithms and the realization principle of its reconfigurable architectures, this paper describes the reconfigurable composite field multiplication over GF((2^8)k) (k=1,2,3,4) in RISC (reduced instruction set computer) processor and VLIW (very long instruction word) processor architecture, respectively. Through configuration, the architectures can realize the composite field multiplication over GF(2^8), GF ((2^8)2), GF((28)3) and GF((28)4) flexibly and efficiently. We simulated the function of circuits and synthesized the reconfigurable design based on the 0.18 μm CMOS (complementary metal oxide semiconductor) standard cell library and the comparison with other same kind designs. The result shows that the reconfigurable design proposed in the paper can provide higher efficiency under the premise of flexibility.
文摘Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches. This paper presents a novel global software pipelining technique, called Thace Software Pipelining,targeted to the instruction-level parallel processors such as Very Long Instruc-tion Word (VLIW) and superscalar machines. Thace software pipelining applies a global code scheduling technique to compact the original loop body. The re-sulting loop is called a trace software pipelined (TSP) code. The trace softwrae pipelined code can be directly executed with special architectural support or call be transformed into a globally software pipelined loop for the current VLIW and superscalar processors. Thus, exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique. This makes our new technique very promis-ing in practical compilers. Finally, we also present the preliminary experimental results to support our new approach.
基金supported by the National Natural Science Foundation of China under Grant Nos. 60873006 and 61070049the International Collaborative Research Program under Grant No. 2010DFB10930+1 种基金the Beijing Science Foundation under Grant No.KZ200910028007the Australian Research Council (ARC) under Grant No. DP0881330
文摘As semi-conductor technologies move down to the nanometer scale, leakage power has become a significant component of the total power consumption. In this paper, we present a leakage-aware modulo scheduling algorithm to achieve leakage energy saving for applications with loops on Very Long Instruction Word (VLIW) architectures. The proposed algorithm is designed to maximize the idleness of function units integrated with the dual-threshold domino logic, and reduce the number of transitions between the active and sleep modes. We have implemented our technique in the Trimaran compiler and conducted experiments using a set of embedded benchmarks from DSPstone and Mibench on the cycle-accurate VLIW simulator of Trimaran. The results show that our technique achieves significant leakage energy saving compared with a previously published DAG-based (Directed Acyclic Graph) leakage-aware scheduling algorithm.