Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and compu...Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.展开更多
As a kind of generative adversarial network(GAN),Cycle-GAN shows an apparent superiority in image style translation.The more complicated architectures with large number of parameters and huge computational complexitie...As a kind of generative adversarial network(GAN),Cycle-GAN shows an apparent superiority in image style translation.The more complicated architectures with large number of parameters and huge computational complexities,cause a big challenge in deployment on resource-constrained platform.To make full use of the parallelism of hardware under guaranteed image quality,this paper improves the generator network to a hardware-friendly Inception module.The optimized framework is named simplified Cycle-GAN(S-CycleGAN),with greatly reduced parameters of convolution,while avoiding the degradation of image quality from structural compression.Testing with the apple2organge and horse2zebra datasets,the experiment results show that the images generated by S-CycleGAN outperform the baseline and other models.The number of parameters reduces by 19.54%,memory usage cuts down by 9.11%,theoretical amount of multiply-adds(Madds)decreases by 17.96%,and floating-point operations per second(FLOPS)diminishes by 18.91%.Finally,the S-CycleGAN was mapped on the dynamic programmable reconfigurable array processor(DPRAP),which calculate the convolution and deconvolution in a unified architecture,and support flexible runtime switching.The prototype systems are implemented on Xilinx field programmable gate array(FPGA)XC6 VLX550 T-FF1759.The synthesized results show that,with 150 MHz,the hardware resource consumption is reduced by 52%compared to the recent FPGA scheme.展开更多
基金Supported by the National Natural Science Foundation of China(No.61802304,61834005,61772417,61602377)the Shaanxi Province KeyR&D Plan(No.2021GY-029)。
文摘Deep learning algorithms have been widely used in computer vision,natural language processing and other fields.However,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long latency.In order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main memory.Through data reuse,the processing speed of the algorithm is further improved.The proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)algorithm.The experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.
基金supported by the National Natural Science Foundation of China(61834005,61772417)the National Science and Technology Major Project(2020AAA0104603)the Key Research and Development Program of Shaanxi(2021GY-029 and 2022GY-027)。
文摘As a kind of generative adversarial network(GAN),Cycle-GAN shows an apparent superiority in image style translation.The more complicated architectures with large number of parameters and huge computational complexities,cause a big challenge in deployment on resource-constrained platform.To make full use of the parallelism of hardware under guaranteed image quality,this paper improves the generator network to a hardware-friendly Inception module.The optimized framework is named simplified Cycle-GAN(S-CycleGAN),with greatly reduced parameters of convolution,while avoiding the degradation of image quality from structural compression.Testing with the apple2organge and horse2zebra datasets,the experiment results show that the images generated by S-CycleGAN outperform the baseline and other models.The number of parameters reduces by 19.54%,memory usage cuts down by 9.11%,theoretical amount of multiply-adds(Madds)decreases by 17.96%,and floating-point operations per second(FLOPS)diminishes by 18.91%.Finally,the S-CycleGAN was mapped on the dynamic programmable reconfigurable array processor(DPRAP),which calculate the convolution and deconvolution in a unified architecture,and support flexible runtime switching.The prototype systems are implemented on Xilinx field programmable gate array(FPGA)XC6 VLX550 T-FF1759.The synthesized results show that,with 150 MHz,the hardware resource consumption is reduced by 52%compared to the recent FPGA scheme.