This study presents a new method of 4-pipelined high-performance split multiply-accumulator (MAC) architecture, which is capable of supporting multiple precisions developed for media processors. To speed up the design...This study presents a new method of 4-pipelined high-performance split multiply-accumulator (MAC) architecture, which is capable of supporting multiple precisions developed for media processors. To speed up the design further, a novel partial product compression circuit based on interleaved adders and a modified hybrid partial product reduction tree (PPRT) scheme are proposed. The MAC can perform 1-way 32-bit, 4-way 16-bit signed/unsigned multiply or multiply-accumulate operations and 2-way parallel multiply add (PMADD) operations at a high frequency of 1.25 GHz under worst-case conditions and 1.67 GHz under typical-case conditions, respectively. Compared with the MAC in 32-bit microprocessor without interlocked piped stages (MIPS), the proposed design shows a great advantage in speed. Moreover, an improvement of up to 32% in throughput is achieved. The MAC design has been fabricated with Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm CMOS standard cell technology and has passed a functional test.展开更多
An efficient design method for a 24 × 24 bit +48 bit parallel saturating multiply-accumulate (MAC) unit is described. The augend in the MAC is merged as a partial product into Wallace tree array. The optimized...An efficient design method for a 24 × 24 bit +48 bit parallel saturating multiply-accumulate (MAC) unit is described. The augend in the MAC is merged as a partial product into Wallace tree array. The optimized saturation detection logic is proposed. The 679. 2 μm × 132. 5μm area size has been achieved in 0. 18 μm 1.8 V 1P6M CMOS technology by the full-custom circuit layout design. The simulation results show that the design way has significantly less area (about 23.52% reduction) and less delay than those of the common saturating MAC based on standard cell library.展开更多
Cross-correlation (CC) is the most time-consuming in the implementation of image matching algorithms based on the correlation method. Therefore, how to calculate CC fast is crucial to real-time image matching. This ...Cross-correlation (CC) is the most time-consuming in the implementation of image matching algorithms based on the correlation method. Therefore, how to calculate CC fast is crucial to real-time image matching. This work reveals that the single cascading multiply-accumulate (CAMAC) and concurrent multiply-accumulate (COMAC) architectures which have been widely used in the past, actually, do not necessarily bring about a satisfactory time performance for CC. To obtain better time performance and higher resource efficiency, this paper proposes a configurable circuit involving the advantages of CAMAC and COMAC for a large amount of multiply-accumulate (MAC) operations of CC in exhaustive search. The proposed circuit works in an array manner and can better adapt to changing size image matching in real-time processing. Experimental results demonstrate that this novel circuit which involves the two structures can complete vast MAC calculations at a very high speed. Compared with existing related work, it improves the computation density further and is more flexible to use.展开更多
基金Project (No. 60873112) supported by the National Natural Science Foundation of China
文摘This study presents a new method of 4-pipelined high-performance split multiply-accumulator (MAC) architecture, which is capable of supporting multiple precisions developed for media processors. To speed up the design further, a novel partial product compression circuit based on interleaved adders and a modified hybrid partial product reduction tree (PPRT) scheme are proposed. The MAC can perform 1-way 32-bit, 4-way 16-bit signed/unsigned multiply or multiply-accumulate operations and 2-way parallel multiply add (PMADD) operations at a high frequency of 1.25 GHz under worst-case conditions and 1.67 GHz under typical-case conditions, respectively. Compared with the MAC in 32-bit microprocessor without interlocked piped stages (MIPS), the proposed design shows a great advantage in speed. Moreover, an improvement of up to 32% in throughput is achieved. The MAC design has been fabricated with Taiwan Semiconductor Manufacturing Company (TSMC) 90-nm CMOS standard cell technology and has passed a functional test.
基金The National Natural Science Foundation of China(No.90407009),the National High Technology Research and Develop-ment Program of China(863Program) (No.2003AA1Z1340)
文摘An efficient design method for a 24 × 24 bit +48 bit parallel saturating multiply-accumulate (MAC) unit is described. The augend in the MAC is merged as a partial product into Wallace tree array. The optimized saturation detection logic is proposed. The 679. 2 μm × 132. 5μm area size has been achieved in 0. 18 μm 1.8 V 1P6M CMOS technology by the full-custom circuit layout design. The simulation results show that the design way has significantly less area (about 23.52% reduction) and less delay than those of the common saturating MAC based on standard cell library.
文摘Cross-correlation (CC) is the most time-consuming in the implementation of image matching algorithms based on the correlation method. Therefore, how to calculate CC fast is crucial to real-time image matching. This work reveals that the single cascading multiply-accumulate (CAMAC) and concurrent multiply-accumulate (COMAC) architectures which have been widely used in the past, actually, do not necessarily bring about a satisfactory time performance for CC. To obtain better time performance and higher resource efficiency, this paper proposes a configurable circuit involving the advantages of CAMAC and COMAC for a large amount of multiply-accumulate (MAC) operations of CC in exhaustive search. The proposed circuit works in an array manner and can better adapt to changing size image matching in real-time processing. Experimental results demonstrate that this novel circuit which involves the two structures can complete vast MAC calculations at a very high speed. Compared with existing related work, it improves the computation density further and is more flexible to use.