期刊文献+
共找到7篇文章
< 1 >
每页显示 20 50 100
Towards high performance low bitwidth training for deep neural networks
1
作者 Chunyou Su Sheng Zhou +1 位作者 Liang Feng Wei Zhang 《Journal of Semiconductors》 EI CAS CSCD 2020年第2期63-72,共10页
The high performance of the state-of-the-art deep neural networks(DNNs)is acquired at the cost of huge consumption of computing resources.Quantization of networks is recently recognized as a promising solution to solv... The high performance of the state-of-the-art deep neural networks(DNNs)is acquired at the cost of huge consumption of computing resources.Quantization of networks is recently recognized as a promising solution to solve the problem and significantly reduce the resource usage.However,the previous quantization works have mostly focused on the DNN inference,and there were very few works to address on the challenges of DNN training.In this paper,we leverage dynamic fixed-point(DFP)quantization algorithm and stochastic rounding(SR)strategy to develop a fully quantized 8-bit neural networks targeting low bitwidth training.The experiments show that,in comparison to the full-precision networks,the accuracy drop of our quantized convolutional neural networks(CNNs)can be less than 2%,even when applied to deep models evaluated on Image-Net dataset.Additionally,our 8-bit GNMT translation network can achieve almost identical BLEU to full-precision network.We further implement a prototype on FPGA and the synthesis shows that the low bitwidth training scheme can reduce the resource usage significantly. 展开更多
关键词 CNN quantized neural networks limited precision training
下载PDF
WinoNet:Reconfigurable look-up table-based Winograd accelerator for arbitrary precision convolutional neural network inference
2
作者 Wang Chengcheng Li He +3 位作者 Cao Yanpeng Song Changjun Yu Feng Tang Yongming 《Journal of Southeast University(English Edition)》 EI CAS 2022年第4期332-339,共8页
To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convo... To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convolution architecture built on a field-programmable gate array using integer multipliers and addition trees is used.With the help of the Winograd algorithm,the optimization of convolution and multiplication is realized to reduce the computational complexity.The LUT-based operator is further optimized to construct a processing unit(PE).Simultaneously optimized storage streams improve memory access efficiency and solve bandwidth constraints.The data toggle rate is reduced to optimize power consumption.The experimental results show that the use of the Winograd algorithm to build basic processing units can significantly reduce the number of multipliers and achieve hardware deployment acceleration,while the time-division multiplexing of processing units improves resource utilization.Under this experimental condition,compared with the traditional convolution method,the architecture optimizes computing resources by 2.25 times and improves the peak throughput by 19.3 times.The LUT-based Winograd accelerator can effectively solve the deployment problem caused by limited hardware resources. 展开更多
关键词 quantized neural networks look-up table(LUT)-based multiplier Winograd algorithm arbitrary precision
下载PDF
FPGA Optimized Accelerator of DCNN with Fast Data Readout and Multiplier Sharing Strategy 被引量:1
3
作者 Tuo Ma Zhiwei Li +3 位作者 Qingjiang Li Haijun Liu Zhongjin Zhao Yinan Wang 《Computers, Materials & Continua》 SCIE EI 2023年第12期3237-3263,共27页
With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware ... With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others. 展开更多
关键词 FPGA ACCELERATOR DCNN fast data readout strategy multiplier sharing strategy network quantization energy efficient
下载PDF
Balanced Quantization: An Effective and Efficient Approach toQuantized Neural Networks 被引量:4
4
作者 Shu-Chang Zhou Yu-Zhi Wang +2 位作者 He Wen Qin-Yao He Yu-Heng Zou 《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第4期667-682,共16页
Quantized neural networks (QNNs), which use low bitwidth numbers for representing parameters and performing computations, have been proposed to reduce the computation complexity, storage size and memory usage. In QNNs... Quantized neural networks (QNNs), which use low bitwidth numbers for representing parameters and performing computations, have been proposed to reduce the computation complexity, storage size and memory usage. In QNNs, parameters and activations are uniformly quantized, such that the multiplications and additions can be accelerated by bitwise operations. However, distributions of parameters in neural networks are often imbalanced, such that the uniform quantization determined from extremal values may underutilize available bitwidth. In this paper, we propose a novel quantization method that can ensure the balance of distributions of quantized values. Our method first recursively partitions the parameters by percentiles into balanced bins, and then applies uniform quantization. We also introduce computationally cheaper approximations of percentiles to reduce the computation overhead introduced. Overall, our method improves the prediction accuracies of QNNs without introducing extra computation during inference, has negligible impact on training speed, and is applicable to both convolutional neural networks and recurrent neural networks. Experiments on standard datasets including ImageNet and Penn Treebank confirm the effectiveness of our method. On ImageNet, the top-5 error rate of our 4-bit quantized GoogLeNet model is 12.7%, which is superior to the state-of-the-arts of QNNs. 展开更多
关键词 quantized neural network percentile histogram equalization uniform quantization
原文传递
A New Image Coding Algorithm Based on Self-Organizing Neural Network 被引量:1
5
作者 LiHongsong QuanZiyi 《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 1995年第1期40-43,共4页
The paper deals with a new VQ+DPCM+DCT algorithm based on Self-Organizing Feature Maps(SOFM) algorithm for image coding. In addition. a Frequency sensitive SOFM (FSOFM) has been also devel-oped. Simulation results sh... The paper deals with a new VQ+DPCM+DCT algorithm based on Self-Organizing Feature Maps(SOFM) algorithm for image coding. In addition. a Frequency sensitive SOFM (FSOFM) has been also devel-oped. Simulation results show that a very good visual quality of the coded image at 0.252 bits/pixel is obtained. 展开更多
关键词 image coding vector quantization (VQ) self-organizing neural network
原文传递
Convolution without multiplication:A general speed up strategy for CNNs 被引量:7
6
作者 CAI GuoRong YANG ShengMing +6 位作者 DU Jing WANG ZongYue HUANG Bin GUAN Yin SU SongJian SU JinHe SU SongZhi 《Science China(Technological Sciences)》 SCIE EI CAS CSCD 2021年第12期2627-2639,共13页
Convolutional Neural Networks(CNN)have achieved great success in many computer vision tasks.However,it is difficult to deploy CNN models on low-cost devices with limited power budgets,because most existing CNN models ... Convolutional Neural Networks(CNN)have achieved great success in many computer vision tasks.However,it is difficult to deploy CNN models on low-cost devices with limited power budgets,because most existing CNN models are computationally expensive.Therefore,CNN model compression and acceleration have become a hot research topic in the deep learning area.Typical schemes for speeding up the feed-forward process with a slight accuracy loss include parameter pruning and sharing,low-rank factorization,compact convolutional filters and knowledge distillation.In this study,we propose a general acceleration scheme that replaces the floating-point multiplication with integer addition.To this end,we propose a general accelerate scheme,where the floating point multiplication is replaced by integer addition.The motivation is based on the fact that every floating point can be replaced by the summation of an exponential series.Therefore,the multiplication between two floating points can be converted to the addition among exponentials.In the experiment section,we directly apply the proposed scheme to AlexNet,VGG,ResNet for image classification,and Faster-RCNN for object detection.The results acquired from ImageNet and PASCAL VOC show that the proposed quantized scheme has a promising performance,even with only one item of exponential.Moreover,we analyzed the eciency of our method on mainstream FPGAs.The experimental results show that the proposed quantized scheme can achieve acceleration on FPGA with a slight accuracy loss. 展开更多
关键词 deep learning convolutional neural network network quantization network speed up
原文传递
DRNet:Towards fast,accurate and practical dish recognition 被引量:1
7
作者 CHENG SiYuan CHU BinFei +4 位作者 ZHONG BiNeng ZHANG ZiKai LIU Xin TANG ZhenJun LI XianXian 《Science China(Technological Sciences)》 SCIE EI CAS CSCD 2021年第12期2651-2661,共11页
Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yield... Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yields a tradeoff between speed and accuracy while adapting to the variation in class numbers.In the first stage,we build an arbitrary-oriented dish detector(AODD)to localize dish position,which can effectively alleviate the impact of background noise and pose variations.In the second stage,we propose a dish reidentifier(DReID)to recognize the registered dishes to handle uncertain categories.To further improve the accuracy of DRNet,we design an attribute recognition(AR)module to predict the attributes of dishes.The attributes are used as auxiliary information to enhance the discriminative ability of DRNet.Moreover,pruning and quantization are processed on our model to be deployed in embedded environments.Finally,to facilitate the study of dish recognition,a well-annotated dataset is established.Our AODD,DReID,AR,and DRNet run at about 14,25,16,and 5 fps on the hardware RKNN 3399 pro,respectively. 展开更多
关键词 neural network acceleration neural network quantization object detection reidentification dish recognition
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部