类脑处理器较深度学习处理器具有能效优势.类脑处理器的片上互连一般采用具有可扩展性高、吞吐量高和通用性高等特点的片上网络.为了解决采用同步片上网络面临的全局时钟树时序难以收敛的问题以及采用异步片上网络面临的链路延迟匹配、...类脑处理器较深度学习处理器具有能效优势.类脑处理器的片上互连一般采用具有可扩展性高、吞吐量高和通用性高等特点的片上网络.为了解决采用同步片上网络面临的全局时钟树时序难以收敛的问题以及采用异步片上网络面临的链路延迟匹配、缺乏电子设计自动化工具实现和验证的问题,提出了一种异步片上网络架构——NosralC,用于构建全局异步局部同步(global asynchronous local synchronous,GALS)的多核类脑处理器.NosralC采用异步链路和同步路由器实现.实验表明,NosralC较同步基线,在4个类脑应用数据集下展现出37.5%~38.9%的功耗降低、5.5%~8.0%的平均延迟降低和36.7%~47.6%的能效提升,同时增加不多于6%的额外资源以及带来较小的性能开销(吞吐量降低0.8%~2.4%).NosralC在现场可编程门阵列(FPGA)上得到了验证,证明了该架构的可实现性.展开更多
This paper describes a circular first in first out (FIFO) and its protocols which have a very low latency while still maintaining high throughput. Unlike the existing serial FIFOs based on asynchronous micropipeline...This paper describes a circular first in first out (FIFO) and its protocols which have a very low latency while still maintaining high throughput. Unlike the existing serial FIFOs based on asynchronous micropipelines, this FIFO's cells communicate directly with the input and output ports through a common bus, which effectively eliminates the data movement from the input port to the output port, thereby reducing the latency and the power consumption. Furthermore, the latency does not increase with the number of FIFO stages. Single-track asynchronous protocols are used to simplify the FIFO controller design, with only three C-gates needed in each cell controller, which substantially reduces the area. Simulations with the TSMC 0.25 μm CMOS logic process show that the latency of the 4-stage FIFO is less than 581 ps and the throughput is higher than 2.2 GHz.展开更多
A networks-on-chip (NoC) cost-effective design method was given based on the globallyasynchronous locally-synchronous (GALS) interconnect structure. In this method, the synchronous mode was used to transmit data a...A networks-on-chip (NoC) cost-effective design method was given based on the globallyasynchronous locally-synchronous (GALS) interconnect structure. In this method, the synchronous mode was used to transmit data among routers, network interface (NI), and intellectual property (IP) via a synchronous circuit. Compared with traditional methods of implementing GALS, this method greatly reduces the transmission latency and is compatible with existing very large scale integration (VLSI) design tools. The platform designed based on the method can support two kinds of packetizing mechanisms, any topology, several kinds of traffic, and many configurable parameters such as the number of virtual channels, thus the platform is universal. An NoC evaluation methodology is given with a case study showing that the platform and evaluation methodology work well.展开更多
文摘类脑处理器较深度学习处理器具有能效优势.类脑处理器的片上互连一般采用具有可扩展性高、吞吐量高和通用性高等特点的片上网络.为了解决采用同步片上网络面临的全局时钟树时序难以收敛的问题以及采用异步片上网络面临的链路延迟匹配、缺乏电子设计自动化工具实现和验证的问题,提出了一种异步片上网络架构——NosralC,用于构建全局异步局部同步(global asynchronous local synchronous,GALS)的多核类脑处理器.NosralC采用异步链路和同步路由器实现.实验表明,NosralC较同步基线,在4个类脑应用数据集下展现出37.5%~38.9%的功耗降低、5.5%~8.0%的平均延迟降低和36.7%~47.6%的能效提升,同时增加不多于6%的额外资源以及带来较小的性能开销(吞吐量降低0.8%~2.4%).NosralC在现场可编程门阵列(FPGA)上得到了验证,证明了该架构的可实现性.
基金Supported by the National Key Basic Research and Development(973) Program of China (No. 2006CB302700)the National High-Tech Research and Development (863) Program of China (No.2007AA01Z2B3)
文摘This paper describes a circular first in first out (FIFO) and its protocols which have a very low latency while still maintaining high throughput. Unlike the existing serial FIFOs based on asynchronous micropipelines, this FIFO's cells communicate directly with the input and output ports through a common bus, which effectively eliminates the data movement from the input port to the output port, thereby reducing the latency and the power consumption. Furthermore, the latency does not increase with the number of FIFO stages. Single-track asynchronous protocols are used to simplify the FIFO controller design, with only three C-gates needed in each cell controller, which substantially reduces the area. Simulations with the TSMC 0.25 μm CMOS logic process show that the latency of the 4-stage FIFO is less than 581 ps and the throughput is higher than 2.2 GHz.
基金Supported by the National Natural Science Foundation of China (No.90607009)the National High-Tech Research and Development(863) Program(No.2008AA01Z107)the National Key Basic Research and Development(973) Program of China(No.2007CB310701)
文摘A networks-on-chip (NoC) cost-effective design method was given based on the globallyasynchronous locally-synchronous (GALS) interconnect structure. In this method, the synchronous mode was used to transmit data among routers, network interface (NI), and intellectual property (IP) via a synchronous circuit. Compared with traditional methods of implementing GALS, this method greatly reduces the transmission latency and is compatible with existing very large scale integration (VLSI) design tools. The platform designed based on the method can support two kinds of packetizing mechanisms, any topology, several kinds of traffic, and many configurable parameters such as the number of virtual channels, thus the platform is universal. An NoC evaluation methodology is given with a case study showing that the platform and evaluation methodology work well.