The high performance of the state-of-the-art deep neural networks(DNNs)is acquired at the cost of huge consumption of computing resources.Quantization of networks is recently recognized as a promising solution to solv...The high performance of the state-of-the-art deep neural networks(DNNs)is acquired at the cost of huge consumption of computing resources.Quantization of networks is recently recognized as a promising solution to solve the problem and significantly reduce the resource usage.However,the previous quantization works have mostly focused on the DNN inference,and there were very few works to address on the challenges of DNN training.In this paper,we leverage dynamic fixed-point(DFP)quantization algorithm and stochastic rounding(SR)strategy to develop a fully quantized 8-bit neural networks targeting low bitwidth training.The experiments show that,in comparison to the full-precision networks,the accuracy drop of our quantized convolutional neural networks(CNNs)can be less than 2%,even when applied to deep models evaluated on Image-Net dataset.Additionally,our 8-bit GNMT translation network can achieve almost identical BLEU to full-precision network.We further implement a prototype on FPGA and the synthesis shows that the low bitwidth training scheme can reduce the resource usage significantly.展开更多
To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convo...To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convolution architecture built on a field-programmable gate array using integer multipliers and addition trees is used.With the help of the Winograd algorithm,the optimization of convolution and multiplication is realized to reduce the computational complexity.The LUT-based operator is further optimized to construct a processing unit(PE).Simultaneously optimized storage streams improve memory access efficiency and solve bandwidth constraints.The data toggle rate is reduced to optimize power consumption.The experimental results show that the use of the Winograd algorithm to build basic processing units can significantly reduce the number of multipliers and achieve hardware deployment acceleration,while the time-division multiplexing of processing units improves resource utilization.Under this experimental condition,compared with the traditional convolution method,the architecture optimizes computing resources by 2.25 times and improves the peak throughput by 19.3 times.The LUT-based Winograd accelerator can effectively solve the deployment problem caused by limited hardware resources.展开更多
With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware ...With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others.展开更多
Quantized neural networks (QNNs), which use low bitwidth numbers for representing parameters and performing computations, have been proposed to reduce the computation complexity, storage size and memory usage. In QNNs...Quantized neural networks (QNNs), which use low bitwidth numbers for representing parameters and performing computations, have been proposed to reduce the computation complexity, storage size and memory usage. In QNNs, parameters and activations are uniformly quantized, such that the multiplications and additions can be accelerated by bitwise operations. However, distributions of parameters in neural networks are often imbalanced, such that the uniform quantization determined from extremal values may underutilize available bitwidth. In this paper, we propose a novel quantization method that can ensure the balance of distributions of quantized values. Our method first recursively partitions the parameters by percentiles into balanced bins, and then applies uniform quantization. We also introduce computationally cheaper approximations of percentiles to reduce the computation overhead introduced. Overall, our method improves the prediction accuracies of QNNs without introducing extra computation during inference, has negligible impact on training speed, and is applicable to both convolutional neural networks and recurrent neural networks. Experiments on standard datasets including ImageNet and Penn Treebank confirm the effectiveness of our method. On ImageNet, the top-5 error rate of our 4-bit quantized GoogLeNet model is 12.7%, which is superior to the state-of-the-arts of QNNs.展开更多
The paper deals with a new VQ+DPCM+DCT algorithm based on Self-Organizing Feature Maps(SOFM) algorithm for image coding. In addition. a Frequency sensitive SOFM (FSOFM) has been also devel-oped. Simulation results sh...The paper deals with a new VQ+DPCM+DCT algorithm based on Self-Organizing Feature Maps(SOFM) algorithm for image coding. In addition. a Frequency sensitive SOFM (FSOFM) has been also devel-oped. Simulation results show that a very good visual quality of the coded image at 0.252 bits/pixel is obtained.展开更多
Convolutional Neural Networks(CNN)have achieved great success in many computer vision tasks.However,it is difficult to deploy CNN models on low-cost devices with limited power budgets,because most existing CNN models ...Convolutional Neural Networks(CNN)have achieved great success in many computer vision tasks.However,it is difficult to deploy CNN models on low-cost devices with limited power budgets,because most existing CNN models are computationally expensive.Therefore,CNN model compression and acceleration have become a hot research topic in the deep learning area.Typical schemes for speeding up the feed-forward process with a slight accuracy loss include parameter pruning and sharing,low-rank factorization,compact convolutional filters and knowledge distillation.In this study,we propose a general acceleration scheme that replaces the floating-point multiplication with integer addition.To this end,we propose a general accelerate scheme,where the floating point multiplication is replaced by integer addition.The motivation is based on the fact that every floating point can be replaced by the summation of an exponential series.Therefore,the multiplication between two floating points can be converted to the addition among exponentials.In the experiment section,we directly apply the proposed scheme to AlexNet,VGG,ResNet for image classification,and Faster-RCNN for object detection.The results acquired from ImageNet and PASCAL VOC show that the proposed quantized scheme has a promising performance,even with only one item of exponential.Moreover,we analyzed the eciency of our method on mainstream FPGAs.The experimental results show that the proposed quantized scheme can achieve acceleration on FPGA with a slight accuracy loss.展开更多
Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yield...Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yields a tradeoff between speed and accuracy while adapting to the variation in class numbers.In the first stage,we build an arbitrary-oriented dish detector(AODD)to localize dish position,which can effectively alleviate the impact of background noise and pose variations.In the second stage,we propose a dish reidentifier(DReID)to recognize the registered dishes to handle uncertain categories.To further improve the accuracy of DRNet,we design an attribute recognition(AR)module to predict the attributes of dishes.The attributes are used as auxiliary information to enhance the discriminative ability of DRNet.Moreover,pruning and quantization are processed on our model to be deployed in embedded environments.Finally,to facilitate the study of dish recognition,a well-annotated dataset is established.Our AODD,DReID,AR,and DRNet run at about 14,25,16,and 5 fps on the hardware RKNN 3399 pro,respectively.展开更多
文摘The high performance of the state-of-the-art deep neural networks(DNNs)is acquired at the cost of huge consumption of computing resources.Quantization of networks is recently recognized as a promising solution to solve the problem and significantly reduce the resource usage.However,the previous quantization works have mostly focused on the DNN inference,and there were very few works to address on the challenges of DNN training.In this paper,we leverage dynamic fixed-point(DFP)quantization algorithm and stochastic rounding(SR)strategy to develop a fully quantized 8-bit neural networks targeting low bitwidth training.The experiments show that,in comparison to the full-precision networks,the accuracy drop of our quantized convolutional neural networks(CNNs)can be less than 2%,even when applied to deep models evaluated on Image-Net dataset.Additionally,our 8-bit GNMT translation network can achieve almost identical BLEU to full-precision network.We further implement a prototype on FPGA and the synthesis shows that the low bitwidth training scheme can reduce the resource usage significantly.
基金The Academic Colleges and Universities Innovation Program 2.0(No.BP0719013)。
文摘To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convolution architecture built on a field-programmable gate array using integer multipliers and addition trees is used.With the help of the Winograd algorithm,the optimization of convolution and multiplication is realized to reduce the computational complexity.The LUT-based operator is further optimized to construct a processing unit(PE).Simultaneously optimized storage streams improve memory access efficiency and solve bandwidth constraints.The data toggle rate is reduced to optimize power consumption.The experimental results show that the use of the Winograd algorithm to build basic processing units can significantly reduce the number of multipliers and achieve hardware deployment acceleration,while the time-division multiplexing of processing units improves resource utilization.Under this experimental condition,compared with the traditional convolution method,the architecture optimizes computing resources by 2.25 times and improves the peak throughput by 19.3 times.The LUT-based Winograd accelerator can effectively solve the deployment problem caused by limited hardware resources.
基金supported in part by the Major Program of the Ministry of Science and Technology of China under Grant 2019YFB2205102in part by the National Natural Science Foundation of China under Grant 61974164,62074166,61804181,62004219,62004220,62104256.
文摘With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others.
文摘Quantized neural networks (QNNs), which use low bitwidth numbers for representing parameters and performing computations, have been proposed to reduce the computation complexity, storage size and memory usage. In QNNs, parameters and activations are uniformly quantized, such that the multiplications and additions can be accelerated by bitwise operations. However, distributions of parameters in neural networks are often imbalanced, such that the uniform quantization determined from extremal values may underutilize available bitwidth. In this paper, we propose a novel quantization method that can ensure the balance of distributions of quantized values. Our method first recursively partitions the parameters by percentiles into balanced bins, and then applies uniform quantization. We also introduce computationally cheaper approximations of percentiles to reduce the computation overhead introduced. Overall, our method improves the prediction accuracies of QNNs without introducing extra computation during inference, has negligible impact on training speed, and is applicable to both convolutional neural networks and recurrent neural networks. Experiments on standard datasets including ImageNet and Penn Treebank confirm the effectiveness of our method. On ImageNet, the top-5 error rate of our 4-bit quantized GoogLeNet model is 12.7%, which is superior to the state-of-the-arts of QNNs.
文摘The paper deals with a new VQ+DPCM+DCT algorithm based on Self-Organizing Feature Maps(SOFM) algorithm for image coding. In addition. a Frequency sensitive SOFM (FSOFM) has been also devel-oped. Simulation results show that a very good visual quality of the coded image at 0.252 bits/pixel is obtained.
基金the National Natural Science Foundation of China(Grant Nos.41971424,61701191)the Key Technical Project of Xiamen Ocean Bureau(Grant No.18CZB033HJ11)+2 种基金the Natural Science Foundation of Fujian Province(Grant Nos.2019J01712,2020J01701)the Key Technical Project of Xiamen Science and Technology Bureau(Grant Nos.3502Z20191018,3502Z20201007,3502Z20191022,3502Z20203057)the Science and Technology Project of Education Department of Fujian Province(Grant Nos.JAT190321,JAT190318,JAT190315)。
文摘Convolutional Neural Networks(CNN)have achieved great success in many computer vision tasks.However,it is difficult to deploy CNN models on low-cost devices with limited power budgets,because most existing CNN models are computationally expensive.Therefore,CNN model compression and acceleration have become a hot research topic in the deep learning area.Typical schemes for speeding up the feed-forward process with a slight accuracy loss include parameter pruning and sharing,low-rank factorization,compact convolutional filters and knowledge distillation.In this study,we propose a general acceleration scheme that replaces the floating-point multiplication with integer addition.To this end,we propose a general accelerate scheme,where the floating point multiplication is replaced by integer addition.The motivation is based on the fact that every floating point can be replaced by the summation of an exponential series.Therefore,the multiplication between two floating points can be converted to the addition among exponentials.In the experiment section,we directly apply the proposed scheme to AlexNet,VGG,ResNet for image classification,and Faster-RCNN for object detection.The results acquired from ImageNet and PASCAL VOC show that the proposed quantized scheme has a promising performance,even with only one item of exponential.Moreover,we analyzed the eciency of our method on mainstream FPGAs.The experimental results show that the proposed quantized scheme can achieve acceleration on FPGA with a slight accuracy loss.
基金the National Natural Science Foundation of China(Grant Nos.61972167 and 61802135)the Project of Guangxi Science and Technology(Grant No.GuiKeAD21075030)+3 种基金the Guangxi“Bagui Scholar”Teams for Innovation and Research Projectthe Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processingthe Guangxi Talent Highland Project of Big Data Intelligence and Applicationthe Open Project Program of the National Laboratory of Pattern Recognition(NLPR)(Grant No.202000012)。
文摘Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yields a tradeoff between speed and accuracy while adapting to the variation in class numbers.In the first stage,we build an arbitrary-oriented dish detector(AODD)to localize dish position,which can effectively alleviate the impact of background noise and pose variations.In the second stage,we propose a dish reidentifier(DReID)to recognize the registered dishes to handle uncertain categories.To further improve the accuracy of DRNet,we design an attribute recognition(AR)module to predict the attributes of dishes.The attributes are used as auxiliary information to enhance the discriminative ability of DRNet.Moreover,pruning and quantization are processed on our model to be deployed in embedded environments.Finally,to facilitate the study of dish recognition,a well-annotated dataset is established.Our AODD,DReID,AR,and DRNet run at about 14,25,16,and 5 fps on the hardware RKNN 3399 pro,respectively.