With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and c...With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults.展开更多
Objective: To explore the effectiveness of various interventions in accelerating tooth movement, a systematic review and net-work meta analysis were used to draw a conclusion. Methods: MEDLINE, EMBASE, Willey Library,...Objective: To explore the effectiveness of various interventions in accelerating tooth movement, a systematic review and net-work meta analysis were used to draw a conclusion. Methods: MEDLINE, EMBASE, Willey Library, EBSCO, Web of Science Databases, and Cochrane Central Register of Controlled Trials databases to identify relevant studies. ADDIS 1.16.6 and Stata 16.0 software were used for NMA. Results: Five thousand five hundred and forty-two articles were searched out. After screening by two independent investigators, forty-seven randomized controlled trials, 1 390 participants, were included in this network meta-analysis. A total of 11 interventions involving Piezocision (Piezo), Photobiomodulation therapy (PBMT), Plate- let-rich plasma(PRP), Electromagnetic field(EF), Low intensity laser therapy(LLLT), Low intensity pulsed ultrasound(LI-PUS), Low-frequency vibrations(LFV), Distraction osteogenesis(DAD), Corticotomy(Corti), Microosteoperforations (MOPS), Traditional orthodontic(OT)were identified and classified into 3 classes including surgical treatment, non-surgical treatment and traditional orthodontic treatment. According to SUCRA probability ranking of the best intervention effect, when orthodontic treatment lasted for 1 month, PBMT (90.6%), Piezo(87.4%) and MOPs(73.6%)were the top three interventions to improve the efficiency of canine tooth movement. When orthodontic treatment lasted for 2 months, Corti (75.7%), Piezo (69.6%) and LFV(58.9%)were the top three interventions for improving the mobility efficiency of canine tooth movement. When orthodontic treatment lasted for 3 months, Cort (73.3%), LLLT(68.4%)and LFV(60.8%)were the top three interventions for improving the mobility efficiency of canine tooth movement. Conclusion: PBMT and Piezo can improve the efficiency of canine tooth movement significantly after 1 month, while Corti and LFV can improve the efficiency of canine tooth movement better after 2 and 3 months.展开更多
With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth ...With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth under the traditional von Neumann architecture is getting worse.Analyzing the algorithmic characteristics of convolutional neural network(CNN),it is found that the access characteristics of convolution(CONV)and fully connected(FC)operations are very different.Based on this feature,a dual-mode reronfigurable distributed memory architecture for CNN accelerator is designed.It can be configured in Bank mode or first input first output(FIFO)mode to accommodate the access needs of different operations.At the same time,a programmable memory control unit is designed,which can effectively control the dual-mode configurable distributed memory architecture by using customized special accessing instructions and reduce the data accessing delay.The proposed architecture is verified and tested by parallel implementation of some CNN algorithms.The experimental results show that the peak bandwidth can reach 13.44 GB·s^(-1)at an operating frequency of 120 MHz.This work can achieve 1.40,1.12,2.80 and 4.70 times the peak bandwidth compared with the existing work.展开更多
Recently,due to the availability of big data and the rapid growth of computing power,artificial intelligence(AI)has regained tremendous attention and investment.Machine learning(ML)approaches have been successfully ap...Recently,due to the availability of big data and the rapid growth of computing power,artificial intelligence(AI)has regained tremendous attention and investment.Machine learning(ML)approaches have been successfully applied to solve many problems in academia and in industry.Although the explosion of big data applications is driving the development of ML,it also imposes severe challenges of data processing speed and scalability on conventional computer systems.Computing platforms that are dedicatedly designed for AI applications have been considered,ranging from a complement to von Neumann platforms to a“must-have”and stand-alone technical solution.These platforms,which belong to a larger category named“domain-specific computing,”focus on specific customization for AI.In this article,we focus on summarizing the recent advances in accelerator designs for deep neural networks(DNNs)-that is,DNN accelerators.We discuss various architectures that support DNN executions in terms of computing units,dataflow optimization,targeted network topologies,architectures on emerging technologies,and accelerators for emerging applications.We also provide our visions on the future trend of AI chip designs.展开更多
Extracting the amplitude and time information from the shaped pulse is an important step in nuclear physics experiments.For this purpose,a neural network can be an alternative in off-line data processing.For processin...Extracting the amplitude and time information from the shaped pulse is an important step in nuclear physics experiments.For this purpose,a neural network can be an alternative in off-line data processing.For processing the data in real time and reducing the off-line data storage required in a trigger event,we designed a customized neural network accelerator on a field programmable gate array platform to implement specific layers in a convolutional neural network.The latter is then used in the front-end electronics of the detector.With fully reconfigurable hardware,a tested neural network structure was used for accurate timing of shaped pulses common in front-end electronics.This design can handle up to four channels of pulse signals at once.The peak performance of each channel is 1.665 Giga operations per second at a working frequency of 25 MHz.展开更多
Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article...Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.展开更多
As a core component in intelligent edge computing,deep neural networks(DNNs)will increasingly play a critically important role in addressing the intelligence-related issues in the industry domain,like smart factories ...As a core component in intelligent edge computing,deep neural networks(DNNs)will increasingly play a critically important role in addressing the intelligence-related issues in the industry domain,like smart factories and autonomous driving.Due to the requirement for a large amount of storage space and computing resources,DNNs are unfavorable for resource-constrained edge computing devices,especially for mobile terminals with scarce energy supply.Binarization of DNN has become a promising technology to achieve a high performance with low resource consumption in edge computing.Field-programmable gate array(FPGA)-based acceleration can further improve the computation efficiency to several times higher compared with the central processing unit(CPU)and graphics processing unit(GPU).This paper gives a brief overview of binary neural networks(BNNs)and the corresponding hardware accelerator designs on edge computing environments,and analyzes some significant studies in detail.The performances of some methods are evaluated through the experiment results,and the latest binarization technologies and hardware acceleration methods are tracked.We first give the background of designing BNNs and present the typical types of BNNs.The FPGA implementation technologies of BNNs are then reviewed.Detailed comparison with experimental evaluation on typical BNNs and their FPGA implementation is further conducted.Finally,certain interesting directions are also illustrated as future work.展开更多
With the rapid development and popularization of artificial intelligence technology,convolutional neural network(CNN)is applied in many fields,and begins to replace most traditional algorithms and gradually deploys to...With the rapid development and popularization of artificial intelligence technology,convolutional neural network(CNN)is applied in many fields,and begins to replace most traditional algorithms and gradually deploys to terminal devices.However,the huge data movement and computational complexity of CNN bring huge power consumption and performance challenges to the hardware,which hinders the application of CNN in embedded devices such as smartphones and smart cars.This paper implements a convolutional neural network accelerator based on Winograd convolution algorithm on field-programmable gate array(FPGA).Firstly,a convolution kernel decomposition method for Winograd convolution is proposed.The convolution kernel larger than 3×3 is divided into multiple 3×3 convolution kernels for convolution operation,and the unsynchronized long convolution operation is processed.Then,we design Winograd convolution array and use configurable multiplier to flexibly realize multiplication for data with different accuracy.Experimental results on VGG16 and AlexNet network show that our accelerator has the most energy efficient and 101 times that of the CPU,5.8 times that of the GPU.At the same time,it has higher energy efficiency than other convolutional neural network accelerators.展开更多
With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware ...With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others.展开更多
With the increasing of data size and model size,deep neural networks(DNNs)show outstanding performance in many artificial intelligence(AI)applications.But the big model size makes it a challenge for high-performance a...With the increasing of data size and model size,deep neural networks(DNNs)show outstanding performance in many artificial intelligence(AI)applications.But the big model size makes it a challenge for high-performance and low-power running DNN on processors,such as central processing unit(CPU),graphics processing unit(GPU),and tensor processing unit(TPU).This paper proposes a LOGNN data representation of 8 bits and a hardware and software co-design deep neural network accelerator LACC to meet the challenge.LOGNN data representation replaces multiply operations to add and shift operations in running DNN.LACC accelerator achieves higher efficiency than the state-of-the-art DNN accelerators by domain specific arithmetic computing units.Finally,LACC speeds up the performance per watt by 1.5 times,compared to the state-of-the-art DNN accelerators on average.展开更多
With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous ve...With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous vehicles, weather forecasting, counter-terrorism, surveillance, traffic management, etc.). However, to achieve such performance, DNN models have become increasingly complicated and deeper, and result in heavy computational stress. Thus, it is not sufficient for the general central processing unit (CPU) processors to meet the real-time application requirements. To deal with this bottleneck, research based on hardware acceleration solution for DNN attracts great attention. Specifically, to meet various real-life applications, DNN acceleration solutions mainly focus on issue of hardware acceleration with intense memory and calculation resource. In this paper, a novel resource-saving architecture based on Field Programmable Gate Array (FPGA) is proposed. Due to the novel designed processing element (PE), the proposed architecture </span><span style="font-family:Verdana;">achieves good performance with the extremely limited calculating resource. The on-chip buffer allocation helps enhance resource-saving performance on memory. Moreover, the accelerator improves its performance by exploiting</span> <span style="font-family:Verdana;">the sparsity property of the input feature map. Compared to other state-of-the-art</span><span style="font-family:Verdana;"> solutions based on FPGA, our architecture achieves good performance, with quite limited resource consumption, thus fully meet the requirement of real-time applications.展开更多
For training the present Neural Network(NN)models,the standard technique is to utilize decaying Learning Rates(LR).While the majority of these techniques commence with a large LR,they will decay multiple times over ti...For training the present Neural Network(NN)models,the standard technique is to utilize decaying Learning Rates(LR).While the majority of these techniques commence with a large LR,they will decay multiple times over time.Decaying has been proved to enhance generalization as well as optimization.Other parameters,such as the network’s size,the number of hidden layers,drop-outs to avoid overfitting,batch size,and so on,are solely based on heuristics.This work has proposed Adaptive Teaching Learning Based(ATLB)Heuristic to identify the optimal hyperparameters for diverse networks.Here we consider three architec-tures Recurrent Neural Networks(RNN),Long Short Term Memory(LSTM),Bidirectional Long Short Term Memory(BiLSTM)of Deep Neural Networks for classification.The evaluation of the proposed ATLB is done through the various learning rate schedulers Cyclical Learning Rate(CLR),Hyperbolic Tangent Decay(HTD),and Toggle between Hyperbolic Tangent Decay and Triangular mode with Restarts(T-HTR)techniques.Experimental results have shown the performance improvement on the 20Newsgroup,Reuters Newswire and IMDB dataset.展开更多
Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,convention...Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%.展开更多
基金the National Key R&D Program of China(No.2018AAA0103300)the National Natural Science Foundation of China(No.61925208,U20A20227,U22A2028)+1 种基金the Chinese Academy of Sciences Project for Young Scientists in Basic Research(No.YSBR-029)the Youth Innovation Promotion Association Chinese Academy of Sciences.
文摘With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults.
基金Hainan Provincial Finance Fund for Science and Technology Program-2020 Hainan Province Key R&D Program for Social Developmen(No.ZDYF2020166)2023 Hainan Province Key R&D Program for Social Development(No.ZDYF2023SHFZ095)。
文摘Objective: To explore the effectiveness of various interventions in accelerating tooth movement, a systematic review and net-work meta analysis were used to draw a conclusion. Methods: MEDLINE, EMBASE, Willey Library, EBSCO, Web of Science Databases, and Cochrane Central Register of Controlled Trials databases to identify relevant studies. ADDIS 1.16.6 and Stata 16.0 software were used for NMA. Results: Five thousand five hundred and forty-two articles were searched out. After screening by two independent investigators, forty-seven randomized controlled trials, 1 390 participants, were included in this network meta-analysis. A total of 11 interventions involving Piezocision (Piezo), Photobiomodulation therapy (PBMT), Plate- let-rich plasma(PRP), Electromagnetic field(EF), Low intensity laser therapy(LLLT), Low intensity pulsed ultrasound(LI-PUS), Low-frequency vibrations(LFV), Distraction osteogenesis(DAD), Corticotomy(Corti), Microosteoperforations (MOPS), Traditional orthodontic(OT)were identified and classified into 3 classes including surgical treatment, non-surgical treatment and traditional orthodontic treatment. According to SUCRA probability ranking of the best intervention effect, when orthodontic treatment lasted for 1 month, PBMT (90.6%), Piezo(87.4%) and MOPs(73.6%)were the top three interventions to improve the efficiency of canine tooth movement. When orthodontic treatment lasted for 2 months, Corti (75.7%), Piezo (69.6%) and LFV(58.9%)were the top three interventions for improving the mobility efficiency of canine tooth movement. When orthodontic treatment lasted for 3 months, Cort (73.3%), LLLT(68.4%)and LFV(60.8%)were the top three interventions for improving the mobility efficiency of canine tooth movement. Conclusion: PBMT and Piezo can improve the efficiency of canine tooth movement significantly after 1 month, while Corti and LFV can improve the efficiency of canine tooth movement better after 2 and 3 months.
基金Supported by the National Key R&D Program of China(No.2022ZD0119001)the National Natural Science Foundation of China(No.61834005,61802304)+1 种基金the Education Department of Shaanxi Province(No.22JY060)the Shaanxi Provincial Key Research and Devel-opment Plan(No.2024GX-YBXM-100)。
文摘With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth under the traditional von Neumann architecture is getting worse.Analyzing the algorithmic characteristics of convolutional neural network(CNN),it is found that the access characteristics of convolution(CONV)and fully connected(FC)operations are very different.Based on this feature,a dual-mode reronfigurable distributed memory architecture for CNN accelerator is designed.It can be configured in Bank mode or first input first output(FIFO)mode to accommodate the access needs of different operations.At the same time,a programmable memory control unit is designed,which can effectively control the dual-mode configurable distributed memory architecture by using customized special accessing instructions and reduce the data accessing delay.The proposed architecture is verified and tested by parallel implementation of some CNN algorithms.The experimental results show that the peak bandwidth can reach 13.44 GB·s^(-1)at an operating frequency of 120 MHz.This work can achieve 1.40,1.12,2.80 and 4.70 times the peak bandwidth compared with the existing work.
基金the National Science Foundations(NSFs)(1822085,1725456,1816833,1500848,1719160,and 1725447)the NSF Computing and Communication Foundations(1740352)+1 种基金the Nanoelectronics COmputing REsearch Program in the Semiconductor Research Corporation(NC-2766-A)the Center for Research in Intelligent Storage and Processing-in-Memory,one of six centers in the Joint University Microelectronics Program,a SRC program sponsored by Defense Advanced Research Projects Agency.
文摘Recently,due to the availability of big data and the rapid growth of computing power,artificial intelligence(AI)has regained tremendous attention and investment.Machine learning(ML)approaches have been successfully applied to solve many problems in academia and in industry.Although the explosion of big data applications is driving the development of ML,it also imposes severe challenges of data processing speed and scalability on conventional computer systems.Computing platforms that are dedicatedly designed for AI applications have been considered,ranging from a complement to von Neumann platforms to a“must-have”and stand-alone technical solution.These platforms,which belong to a larger category named“domain-specific computing,”focus on specific customization for AI.In this article,we focus on summarizing the recent advances in accelerator designs for deep neural networks(DNNs)-that is,DNN accelerators.We discuss various architectures that support DNN executions in terms of computing units,dataflow optimization,targeted network topologies,architectures on emerging technologies,and accelerators for emerging applications.We also provide our visions on the future trend of AI chip designs.
基金supported by the National Natural Science Foundation of China(Nos.11875146 and 11505074)National Key Research and Development Program of China(No.2016YFE0100900).
文摘Extracting the amplitude and time information from the shaped pulse is an important step in nuclear physics experiments.For this purpose,a neural network can be an alternative in off-line data processing.For processing the data in real time and reducing the off-line data storage required in a trigger event,we designed a customized neural network accelerator on a field programmable gate array platform to implement specific layers in a convolutional neural network.The latter is then used in the front-end electronics of the detector.With fully reconfigurable hardware,a tested neural network structure was used for accurate timing of shaped pulses common in front-end electronics.This design can handle up to four channels of pulse signals at once.The peak performance of each channel is 1.665 Giga operations per second at a working frequency of 25 MHz.
基金partially supported by the National Key Research and Development Program of China (under Grant 2017YFB1003101, 2018AAA0103300, 2017YFA0700900, 2017YFA0700902, 2017YFA0700901)the National Natural Science Foundation of China (under Grant 61732007, 61432016, 61532016, 61672491, 61602441, 61602446, 61732002, 61702478, and 61732020)+6 种基金Beijing Natural Science Foundation (JQ18013)National Science and Technology Major Project (2018ZX01031102)the Transformation and Transferof Scientific and Technological Achievements of Chinese Academy of Sciences (KFJ-HGZX-013)Key Research Projects in Frontier Science of Chinese Academy of Sciences (QYZDBSSW-JSC001)Strategic Priority Research Program of Chinese Academy of Science (XDB32050200, XDC01020000)Standardization Research Project of Chinese Academy of Sciences (BZ201800001)Beijing Academy of Artificial Intelligence (BAAI) and Beijing Nova Program of Science and Technology (Z191100001119093)
文摘Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.
基金supported by the Natural Science Foundation of Sichuan Province of China under Grant No.2022NSFSC0500the National Natural Science Foundation of China under Grant No.62072076.
文摘As a core component in intelligent edge computing,deep neural networks(DNNs)will increasingly play a critically important role in addressing the intelligence-related issues in the industry domain,like smart factories and autonomous driving.Due to the requirement for a large amount of storage space and computing resources,DNNs are unfavorable for resource-constrained edge computing devices,especially for mobile terminals with scarce energy supply.Binarization of DNN has become a promising technology to achieve a high performance with low resource consumption in edge computing.Field-programmable gate array(FPGA)-based acceleration can further improve the computation efficiency to several times higher compared with the central processing unit(CPU)and graphics processing unit(GPU).This paper gives a brief overview of binary neural networks(BNNs)and the corresponding hardware accelerator designs on edge computing environments,and analyzes some significant studies in detail.The performances of some methods are evaluated through the experiment results,and the latest binarization technologies and hardware acceleration methods are tracked.We first give the background of designing BNNs and present the typical types of BNNs.The FPGA implementation technologies of BNNs are then reviewed.Detailed comparison with experimental evaluation on typical BNNs and their FPGA implementation is further conducted.Finally,certain interesting directions are also illustrated as future work.
基金supported by the Project of the State Grid Corporation of China in 2022(No.5700-201941501A-0-0-00)the National Natural Science Foundation of China(No.U21B2031).
文摘With the rapid development and popularization of artificial intelligence technology,convolutional neural network(CNN)is applied in many fields,and begins to replace most traditional algorithms and gradually deploys to terminal devices.However,the huge data movement and computational complexity of CNN bring huge power consumption and performance challenges to the hardware,which hinders the application of CNN in embedded devices such as smartphones and smart cars.This paper implements a convolutional neural network accelerator based on Winograd convolution algorithm on field-programmable gate array(FPGA).Firstly,a convolution kernel decomposition method for Winograd convolution is proposed.The convolution kernel larger than 3×3 is divided into multiple 3×3 convolution kernels for convolution operation,and the unsynchronized long convolution operation is processed.Then,we design Winograd convolution array and use configurable multiplier to flexibly realize multiplication for data with different accuracy.Experimental results on VGG16 and AlexNet network show that our accelerator has the most energy efficient and 101 times that of the CPU,5.8 times that of the GPU.At the same time,it has higher energy efficiency than other convolutional neural network accelerators.
基金supported in part by the Major Program of the Ministry of Science and Technology of China under Grant 2019YFB2205102in part by the National Natural Science Foundation of China under Grant 61974164,62074166,61804181,62004219,62004220,62104256.
文摘With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others.
基金Supported by the National Key Research and Development Program of China(No.2018AAA0103300,2017YFA0700900,2017YFA0700902,2017YFA0700901,2019AAA0103802,2020AAA0103802)。
文摘With the increasing of data size and model size,deep neural networks(DNNs)show outstanding performance in many artificial intelligence(AI)applications.But the big model size makes it a challenge for high-performance and low-power running DNN on processors,such as central processing unit(CPU),graphics processing unit(GPU),and tensor processing unit(TPU).This paper proposes a LOGNN data representation of 8 bits and a hardware and software co-design deep neural network accelerator LACC to meet the challenge.LOGNN data representation replaces multiply operations to add and shift operations in running DNN.LACC accelerator achieves higher efficiency than the state-of-the-art DNN accelerators by domain specific arithmetic computing units.Finally,LACC speeds up the performance per watt by 1.5 times,compared to the state-of-the-art DNN accelerators on average.
文摘With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous vehicles, weather forecasting, counter-terrorism, surveillance, traffic management, etc.). However, to achieve such performance, DNN models have become increasingly complicated and deeper, and result in heavy computational stress. Thus, it is not sufficient for the general central processing unit (CPU) processors to meet the real-time application requirements. To deal with this bottleneck, research based on hardware acceleration solution for DNN attracts great attention. Specifically, to meet various real-life applications, DNN acceleration solutions mainly focus on issue of hardware acceleration with intense memory and calculation resource. In this paper, a novel resource-saving architecture based on Field Programmable Gate Array (FPGA) is proposed. Due to the novel designed processing element (PE), the proposed architecture </span><span style="font-family:Verdana;">achieves good performance with the extremely limited calculating resource. The on-chip buffer allocation helps enhance resource-saving performance on memory. Moreover, the accelerator improves its performance by exploiting</span> <span style="font-family:Verdana;">the sparsity property of the input feature map. Compared to other state-of-the-art</span><span style="font-family:Verdana;"> solutions based on FPGA, our architecture achieves good performance, with quite limited resource consumption, thus fully meet the requirement of real-time applications.
文摘For training the present Neural Network(NN)models,the standard technique is to utilize decaying Learning Rates(LR).While the majority of these techniques commence with a large LR,they will decay multiple times over time.Decaying has been proved to enhance generalization as well as optimization.Other parameters,such as the network’s size,the number of hidden layers,drop-outs to avoid overfitting,batch size,and so on,are solely based on heuristics.This work has proposed Adaptive Teaching Learning Based(ATLB)Heuristic to identify the optimal hyperparameters for diverse networks.Here we consider three architec-tures Recurrent Neural Networks(RNN),Long Short Term Memory(LSTM),Bidirectional Long Short Term Memory(BiLSTM)of Deep Neural Networks for classification.The evaluation of the proposed ATLB is done through the various learning rate schedulers Cyclical Learning Rate(CLR),Hyperbolic Tangent Decay(HTD),and Toggle between Hyperbolic Tangent Decay and Triangular mode with Restarts(T-HTR)techniques.Experimental results have shown the performance improvement on the 20Newsgroup,Reuters Newswire and IMDB dataset.
基金Supported by the National Key Research and Development Program of China(No.2017YFB1003101,2018AAA0103300,2017YFA0700900)the National Natural Science Foundation of China(No.61702478,61732007,61906179)+2 种基金the Beijing Natural Science Foundation(No.JQ18013)the National Science and Technology Major Project(No.2018ZX01031102)the Beijing Academy of Artificial Intelligence
文摘Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%.