With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and c...With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults.展开更多
With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth ...With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth under the traditional von Neumann architecture is getting worse.Analyzing the algorithmic characteristics of convolutional neural network(CNN),it is found that the access characteristics of convolution(CONV)and fully connected(FC)operations are very different.Based on this feature,a dual-mode reronfigurable distributed memory architecture for CNN accelerator is designed.It can be configured in Bank mode or first input first output(FIFO)mode to accommodate the access needs of different operations.At the same time,a programmable memory control unit is designed,which can effectively control the dual-mode configurable distributed memory architecture by using customized special accessing instructions and reduce the data accessing delay.The proposed architecture is verified and tested by parallel implementation of some CNN algorithms.The experimental results show that the peak bandwidth can reach 13.44 GB·s^(-1)at an operating frequency of 120 MHz.This work can achieve 1.40,1.12,2.80 and 4.70 times the peak bandwidth compared with the existing work.展开更多
A method for reducing noise radiated from structures by vibration absorbers is presented. Since usual design method for the absorbers is invalid for noise reduction, the peaks of noise power in the frequency domain as...A method for reducing noise radiated from structures by vibration absorbers is presented. Since usual design method for the absorbers is invalid for noise reduction, the peaks of noise power in the frequency domain as cost functions are applied. Hence, the equations for obtaining optimal parameters of the absorbers become nonlinear expressions. To have the parameters, an accelerated neural network procedure has been presented. Numerical calculations have been carried out for a plate type cantilever beam with a large width, and experimental tests have been also performed for the same beam. It is clarified that the present method is valid for reducing noise radiated from structures. As for the usual design method for the absorbers, model analysis has been given, so the number of absorbers should be the same as that of the considered modes. While the nonlinear problem can be dealt with by the present method, there is no restriction on the number of absorbers or the model number.展开更多
Recent years,neural networks(NNs)have received increasing attention from both academia and industry.So far significant diversity among existing NNs as well as their hardware platforms makes NN programming a daunting t...Recent years,neural networks(NNs)have received increasing attention from both academia and industry.So far significant diversity among existing NNs as well as their hardware platforms makes NN programming a daunting task.In this paper,a domain-specific language(DSL)for NNs,neural network language(NNL)is proposed to deliver productivity of NN programming and portable performance of NN execution on different hardware platforms.The productivity and flexibility of NN programming are enabled by abstracting NNs as a directed graph of blocks.The language describes 4 representative and widely used NNs and runs them on 3 different hardware platforms(CPU,GPU and NN accelerator).Experimental results show that NNs written with the proposed language are,on average,14.5%better than the baseline implementations across these 3 platforms.Moreover,compared with the Caffe framework that specifically targets the GPU platform,the code can achieve similar performance.展开更多
Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article...Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.展开更多
Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,convention...Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%.展开更多
At 4:50 on April 30, China's LM-3B/I rocket, an improved type based on LM-3B, made its debut at the Xichang Satellite Launch Center and successfully sending the 12th and 13th BeiDou Navigation Satellite System sat...At 4:50 on April 30, China's LM-3B/I rocket, an improved type based on LM-3B, made its debut at the Xichang Satellite Launch Center and successfully sending the 12th and 13th BeiDou Navigation Satellite System satellites into the planned transfer orbit in space. It was the first time that China launched two BeiDou satellites with one rocket. It was展开更多
Web offers a very convenient way to access remote information resources, an important measurement of evaluating Web services quality is how long it takes to search and get information. By caching the Web server’s dyn...Web offers a very convenient way to access remote information resources, an important measurement of evaluating Web services quality is how long it takes to search and get information. By caching the Web server’s dynamic content, it can avoid repeated queries for database and reduce the access frequency of original resources, thus to improve the speed of server’s response. This paper describes the concept, advantages, principles and concrete realization procedure of a dynamic content cache module for Web server. Key words dynamic content caching - network acceleration - apache module CLC number TP 393.09 Foundation item: Supported by the Science Committee of WuhanBiography: LIU Dan (1980-), male, Master candidate, research direction: high speed computer network, high performance server clusters system.展开更多
The warehouse environment parameter monitoring system is designed to avoid the networking and high cost of traditional monitoring system.A sensor error correction model which combines particle swarm optimization(PSO)w...The warehouse environment parameter monitoring system is designed to avoid the networking and high cost of traditional monitoring system.A sensor error correction model which combines particle swarm optimization(PSO)with back propagation(BP)neural network algorithm is established to reduce nonlinear characteristics and improve test accuracy of the system.Simulation and experiments indicate that the PSO-BP neural network algorithm has advantages of fast convergence rate and high diagnostic accuracy.The monitoring system can provide higher measurement precision,lower power consume,stable network data communication and fault diagnoses function.The system has been applied to monitoring environment parameter of warehouse,special vehicles and ships,etc.展开更多
Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-...Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-ber of nodes in a graph.GCNs involve chain sparse-dense matrix multiplications with six loops,which results in a large de-sign space for GCN accelerators.Prior work on GCN acceleration either employs limited loop optimization techniques,or determines the design variables based on random sampling,which can hardly exploit data reuse efficiently,thus degrading system efficiency.To overcome this limitation,this paper proposes GShuttle,a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency.GShuttle systematically explores loop opti-mization techniques for GCN acceleration,and quantitatively analyzes the design objectives(e.g.,required DRAM access-es and SRAM accesses)by analytical calculation based on multiple design variables.GShuttle further employs two ap-proaches,pruned search space sweeping and greedy search,to find the optimal design variables under certain design con-straints.We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets.The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches.展开更多
Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static...Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static computational graphs by static scheduling methods,usually focus on optimizing static neural networks in deep neural network(DNN)accelerators.We analyze the execution process of dynamic neural networks and observe that dynamic features introduce challenges for efficient scheduling and pipelining in existing DNN accelerators.We propose DyPipe,a holistic approach to optimizing dynamic neural network inferences in enhanced DNN accelerators.DyPipe achieves significant performance improvements for dynamic neural networks while it introduces negligible overhead for static neural networks.Our evaluation demonstrates that DyPipe achieves 1.7x speedup on dynamic neural networks and maintains more than 96%performance for static neural networks.展开更多
Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems.At the same time,the computational complexity and resource consumption of t...Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems.At the same time,the computational complexity and resource consumption of these networks continue to increase.This poses a significant challenge to the deployment of such networks,especially in real-time applications or on resource-limited devices.Thus,network acceleration has become a hot topic within the deep learning community.As for hardware implementation of deep neural networks,a batch of accelerators based on a field-programmable gate array(FPGA) or an application-specific integrated circuit(ASIC)have been proposed in recent years.In this paper,we provide a comprehensive survey of recent advances in network acceleration,compression,and accelerator design from both algorithm and hardware points of view.Specifically,we provide a thorough analysis of each of the following topics:network pruning,low-rank approximation,network quantization,teacher–student networks,compact network design,and hardware accelerators.Finally,we introduce and discuss a few possible future directions.展开更多
Convolutional Neural Networks(CNNs)are widely used in computer vision,natural language processing,and so on,which generally require low power and high efficiency in real applications.Thus,energy efficiency has become ...Convolutional Neural Networks(CNNs)are widely used in computer vision,natural language processing,and so on,which generally require low power and high efficiency in real applications.Thus,energy efficiency has become a critical indicator of CNN accelerators.Considering that asynchronous circuits have the advantages of low power consumption,high speed,and no clock distribution problems,we design and implement an energy-efficient asynchronous CNN accelerator with a 65 nm Complementary Metal Oxide Semiconductor(CMOS)process.Given the absence of a commercial design tool flow for asynchronous circuits,we develop a novel design flow to implement Click-based asynchronous bundled data circuits efficiently to mask layout with conventional Electronic Design Automation(EDA)tools.We also introduce an adaptive delay matching method and perform accurate static timing analysis for the circuits to ensure correct timing.The accelerator for handwriting recognition network(LeNet-5 model)is implemented.Silicon test results show that the asynchronous accelerator has 30%less power in computing array than the synchronous one and that the energy efficiency of the asynchronous accelerator achieves 1.538 TOPS/W,which is 12%higher than that of the synchronous chip.展开更多
Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory cap...Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory capacity has become a constraint to deploy models on UNNA platforms.Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant.In this paper,we propose Tetris:a heuristic static memory management framework for UNNA platforms.Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness interval.Then the memory management problem is converted to a sequence permutation problem.Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints.We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods,and achieves an average memory reduction ratio of 91.9%and 87.9%for a quad-core and a 16-core Cambricon-X platform,respectively.展开更多
With the support by the National Natural Science Foundation of China and the'Strategic Priority Research Program'of the Chinese Academy of Sciences,a collaborative study by the research groups led by Professor...With the support by the National Natural Science Foundation of China and the'Strategic Priority Research Program'of the Chinese Academy of Sciences,a collaborative study by the research groups led by Professors Tian Zhixi(田志喜),Wang Guodong(王国栋),and Zhu Baoge(朱保葛)from the展开更多
Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yield...Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yields a tradeoff between speed and accuracy while adapting to the variation in class numbers.In the first stage,we build an arbitrary-oriented dish detector(AODD)to localize dish position,which can effectively alleviate the impact of background noise and pose variations.In the second stage,we propose a dish reidentifier(DReID)to recognize the registered dishes to handle uncertain categories.To further improve the accuracy of DRNet,we design an attribute recognition(AR)module to predict the attributes of dishes.The attributes are used as auxiliary information to enhance the discriminative ability of DRNet.Moreover,pruning and quantization are processed on our model to be deployed in embedded environments.Finally,to facilitate the study of dish recognition,a well-annotated dataset is established.Our AODD,DReID,AR,and DRNet run at about 14,25,16,and 5 fps on the hardware RKNN 3399 pro,respectively.展开更多
To tackle the challenge of applying convolutional neural network(CNN)in field-programmable gate array(FPGA)due to its computational complexity,a high-performance CNN hardware accelerator based on Verilog hardware desc...To tackle the challenge of applying convolutional neural network(CNN)in field-programmable gate array(FPGA)due to its computational complexity,a high-performance CNN hardware accelerator based on Verilog hardware description language was designed,which utilizes a pipeline architecture with three parallel dimensions including input channels,output channels,and convolution kernels.Firstly,two multiply-and-accumulate(MAC)operations were packed into one digital signal processing(DSP)block of FPGA to double the computation rate of the CNN accelerator.Secondly,strategies of feature map block partitioning and special memory arrangement were proposed to optimize the total amount of off-chip access memory and reduce the pressure on FPGA bandwidth.Finally,an efficient computational array combining multiplicative-additive tree and Winograd fast convolution algorithm was designed to balance hardware resource consumption and computational performance.The high parallel CNN accelerator was deployed in ZU3 EG of Alinx,using the YOLOv3-tiny algorithm as the test object.The average computing performance of the CNN accelerator is 127.5 giga operations per second(GOPS).The experimental results show that the hardware architecture effectively improves the computational power of CNN and provides better performance compared with other existing schemes in terms of power consumption and the efficiency of DSPs and block random access memory(BRAMs).展开更多
基金the National Key R&D Program of China(No.2018AAA0103300)the National Natural Science Foundation of China(No.61925208,U20A20227,U22A2028)+1 种基金the Chinese Academy of Sciences Project for Young Scientists in Basic Research(No.YSBR-029)the Youth Innovation Promotion Association Chinese Academy of Sciences.
文摘With the increasing demand of computational power in artificial intelligence(AI)algorithms,dedicated accelerators have become a necessity.However,the complexity of hardware architectures,vast design search space,and complex tasks of accelerators have posed significant challenges.Tra-ditional search methods can become prohibitively slow if the search space continues to be expanded.A design space exploration(DSE)method is proposed based on transfer learning,which reduces the time for repeated training and uses multi-task models for different tasks on the same processor.The proposed method accurately predicts the latency and energy consumption associated with neural net-work accelerator design parameters,enabling faster identification of optimal outcomes compared with traditional methods.And compared with other DSE methods by using multilayer perceptron(MLP),the required training time is shorter.Comparative experiments with other methods demonstrate that the proposed method improves the efficiency of DSE without compromising the accuracy of the re-sults.
基金Supported by the National Key R&D Program of China(No.2022ZD0119001)the National Natural Science Foundation of China(No.61834005,61802304)+1 种基金the Education Department of Shaanxi Province(No.22JY060)the Shaanxi Provincial Key Research and Devel-opment Plan(No.2024GX-YBXM-100)。
文摘With the rapid development of deep learning algorithms,the computational complexity and functional diversity are increasing rapidly.However,the gap between high computational density and insufficient memory bandwidth under the traditional von Neumann architecture is getting worse.Analyzing the algorithmic characteristics of convolutional neural network(CNN),it is found that the access characteristics of convolution(CONV)and fully connected(FC)operations are very different.Based on this feature,a dual-mode reronfigurable distributed memory architecture for CNN accelerator is designed.It can be configured in Bank mode or first input first output(FIFO)mode to accommodate the access needs of different operations.At the same time,a programmable memory control unit is designed,which can effectively control the dual-mode configurable distributed memory architecture by using customized special accessing instructions and reduce the data accessing delay.The proposed architecture is verified and tested by parallel implementation of some CNN algorithms.The experimental results show that the peak bandwidth can reach 13.44 GB·s^(-1)at an operating frequency of 120 MHz.This work can achieve 1.40,1.12,2.80 and 4.70 times the peak bandwidth compared with the existing work.
文摘A method for reducing noise radiated from structures by vibration absorbers is presented. Since usual design method for the absorbers is invalid for noise reduction, the peaks of noise power in the frequency domain as cost functions are applied. Hence, the equations for obtaining optimal parameters of the absorbers become nonlinear expressions. To have the parameters, an accelerated neural network procedure has been presented. Numerical calculations have been carried out for a plate type cantilever beam with a large width, and experimental tests have been also performed for the same beam. It is clarified that the present method is valid for reducing noise radiated from structures. As for the usual design method for the absorbers, model analysis has been given, so the number of absorbers should be the same as that of the considered modes. While the nonlinear problem can be dealt with by the present method, there is no restriction on the number of absorbers or the model number.
基金the National Key Research and Development Program of China(No.2017YFA0700902,2017YFB1003101)the National Natural Science Foundation of China(No.61472396,61432016,61473275,61522211,61532016,61521092,61502446,61672491,61602441,61602446,61732002,61702478)+3 种基金the 973 Program of China(No.2015CB358800)National Science and Technology Major Project(No.2018ZX01031102)the Transformation and Transfer of Scientific and Technological Achievements of Chinese Academy of Sciences(No.KFJ-HGZX-013)Strategic Priority Research Program of Chinese Academy of Sciences(No.XDBS01050200).
文摘Recent years,neural networks(NNs)have received increasing attention from both academia and industry.So far significant diversity among existing NNs as well as their hardware platforms makes NN programming a daunting task.In this paper,a domain-specific language(DSL)for NNs,neural network language(NNL)is proposed to deliver productivity of NN programming and portable performance of NN execution on different hardware platforms.The productivity and flexibility of NN programming are enabled by abstracting NNs as a directed graph of blocks.The language describes 4 representative and widely used NNs and runs them on 3 different hardware platforms(CPU,GPU and NN accelerator).Experimental results show that NNs written with the proposed language are,on average,14.5%better than the baseline implementations across these 3 platforms.Moreover,compared with the Caffe framework that specifically targets the GPU platform,the code can achieve similar performance.
基金partially supported by the National Key Research and Development Program of China (under Grant 2017YFB1003101, 2018AAA0103300, 2017YFA0700900, 2017YFA0700902, 2017YFA0700901)the National Natural Science Foundation of China (under Grant 61732007, 61432016, 61532016, 61672491, 61602441, 61602446, 61732002, 61702478, and 61732020)+6 种基金Beijing Natural Science Foundation (JQ18013)National Science and Technology Major Project (2018ZX01031102)the Transformation and Transferof Scientific and Technological Achievements of Chinese Academy of Sciences (KFJ-HGZX-013)Key Research Projects in Frontier Science of Chinese Academy of Sciences (QYZDBSSW-JSC001)Strategic Priority Research Program of Chinese Academy of Science (XDB32050200, XDC01020000)Standardization Research Project of Chinese Academy of Sciences (BZ201800001)Beijing Academy of Artificial Intelligence (BAAI) and Beijing Nova Program of Science and Technology (Z191100001119093)
文摘Recent years,the deep learning algorithm has been widely deployed from cloud servers to terminal units.And researchers proposed various neural network accelerators and software development environments.In this article,we have reviewed the representative neural network accelerators.As an entirety,the corresponding software stack must consider the hardware architecture of the specific accelerator to enhance the end-to-end performance.And we summarize the programming environments of neural network accelerators and optimizations in software stack.Finally,we comment the future trend of neural network accelerator and programming environments.
基金Supported by the National Key Research and Development Program of China(No.2017YFB1003101,2018AAA0103300,2017YFA0700900)the National Natural Science Foundation of China(No.61702478,61732007,61906179)+2 种基金the Beijing Natural Science Foundation(No.JQ18013)the National Science and Technology Major Project(No.2018ZX01031102)the Beijing Academy of Artificial Intelligence
文摘Deep learning has now been widely used in intelligent apps of mobile devices.In pursuit of ultra-low power and latency,integrating neural network accelerators(NNA)to mobile phones has become a trend.However,conventional deep learning programming frameworks are not well-developed to support such devices,leading to low computing efficiency and high memory-occupation.To address this problem,a 2-stage pipeline is proposed for optimizing deep learning model inference on mobile devices with NNAs in terms of both speed and memory-footprint.The 1 st stage reduces computation workload via graph optimization,including splitting and merging nodes.The 2 nd stage goes further by optimizing at compilation level,including kernel fusion and in-advance compilation.The proposed optimizations on a commercial mobile phone with an NNA is evaluated.The experimental results show that the proposed approaches achieve 2.8×to 26×speed up,and reduce the memory-footprint by up to 75%.
文摘At 4:50 on April 30, China's LM-3B/I rocket, an improved type based on LM-3B, made its debut at the Xichang Satellite Launch Center and successfully sending the 12th and 13th BeiDou Navigation Satellite System satellites into the planned transfer orbit in space. It was the first time that China launched two BeiDou satellites with one rocket. It was
文摘Web offers a very convenient way to access remote information resources, an important measurement of evaluating Web services quality is how long it takes to search and get information. By caching the Web server’s dynamic content, it can avoid repeated queries for database and reduce the access frequency of original resources, thus to improve the speed of server’s response. This paper describes the concept, advantages, principles and concrete realization procedure of a dynamic content cache module for Web server. Key words dynamic content caching - network acceleration - apache module CLC number TP 393.09 Foundation item: Supported by the Science Committee of WuhanBiography: LIU Dan (1980-), male, Master candidate, research direction: high speed computer network, high performance server clusters system.
文摘The warehouse environment parameter monitoring system is designed to avoid the networking and high cost of traditional monitoring system.A sensor error correction model which combines particle swarm optimization(PSO)with back propagation(BP)neural network algorithm is established to reduce nonlinear characteristics and improve test accuracy of the system.Simulation and experiments indicate that the PSO-BP neural network algorithm has advantages of fast convergence rate and high diagnostic accuracy.The monitoring system can provide higher measurement precision,lower power consume,stable network data communication and fault diagnoses function.The system has been applied to monitoring environment parameter of warehouse,special vehicles and ships,etc.
基金supported by the U.S.National Science Foundation under Grant Nos.CCF-2131946,CCF-1953980,and CCF-1702980.
文摘Graph convolutional neural networks(GCNs)have emerged as an effective approach to extending deep learning for graph data analytics,but they are computationally challenging given the irregular graphs and the large num-ber of nodes in a graph.GCNs involve chain sparse-dense matrix multiplications with six loops,which results in a large de-sign space for GCN accelerators.Prior work on GCN acceleration either employs limited loop optimization techniques,or determines the design variables based on random sampling,which can hardly exploit data reuse efficiently,thus degrading system efficiency.To overcome this limitation,this paper proposes GShuttle,a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency.GShuttle systematically explores loop opti-mization techniques for GCN acceleration,and quantitatively analyzes the design objectives(e.g.,required DRAM access-es and SRAM accesses)by analytical calculation based on multiple design variables.GShuttle further employs two ap-proaches,pruned search space sweeping and greedy search,to find the optimal design variables under certain design con-straints.We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets.The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches.
基金supported by the Beijing Natural Science Foundation under Grant No.JQ18013the National Natural Science Foundation of China under Grant Nos.61925208,61732007,61732002 and 61906179+1 种基金the Strategic Priority Research Program of Chinese Academy of Sciences(CAS)under Grant No.XDB32050200the Youth Innovation Promotion Association CAS,Beijing Academy of Artificial Intelligence(BAAI)and Xplore Prize.
文摘Dynamic neural network(NN)techniques are increasingly important because they facilitate deep learning techniques with more complex network architectures.However,existing studies,which predominantly optimize the static computational graphs by static scheduling methods,usually focus on optimizing static neural networks in deep neural network(DNN)accelerators.We analyze the execution process of dynamic neural networks and observe that dynamic features introduce challenges for efficient scheduling and pipelining in existing DNN accelerators.We propose DyPipe,a holistic approach to optimizing dynamic neural network inferences in enhanced DNN accelerators.DyPipe achieves significant performance improvements for dynamic neural networks while it introduces negligible overhead for static neural networks.Our evaluation demonstrates that DyPipe achieves 1.7x speedup on dynamic neural networks and maintains more than 96%performance for static neural networks.
文摘Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems.At the same time,the computational complexity and resource consumption of these networks continue to increase.This poses a significant challenge to the deployment of such networks,especially in real-time applications or on resource-limited devices.Thus,network acceleration has become a hot topic within the deep learning community.As for hardware implementation of deep neural networks,a batch of accelerators based on a field-programmable gate array(FPGA) or an application-specific integrated circuit(ASIC)have been proposed in recent years.In this paper,we provide a comprehensive survey of recent advances in network acceleration,compression,and accelerator design from both algorithm and hardware points of view.Specifically,we provide a thorough analysis of each of the following topics:network pruning,low-rank approximation,network quantization,teacher–student networks,compact network design,and hardware accelerators.Finally,we introduce and discuss a few possible future directions.
基金supported by National Science and Technology Major Project from Minister of Science and Technology,China(No.2018AAA0103100)the National Natural Science Foundation of China(No.61674090)+1 种基金partly supported by Beijing National Research Center for Information Science and Technology(No.042003266)Beijing Engineering Research Center(No.BG0149)。
文摘Convolutional Neural Networks(CNNs)are widely used in computer vision,natural language processing,and so on,which generally require low power and high efficiency in real applications.Thus,energy efficiency has become a critical indicator of CNN accelerators.Considering that asynchronous circuits have the advantages of low power consumption,high speed,and no clock distribution problems,we design and implement an energy-efficient asynchronous CNN accelerator with a 65 nm Complementary Metal Oxide Semiconductor(CMOS)process.Given the absence of a commercial design tool flow for asynchronous circuits,we develop a novel design flow to implement Click-based asynchronous bundled data circuits efficiently to mask layout with conventional Electronic Design Automation(EDA)tools.We also introduce an adaptive delay matching method and perform accurate static timing analysis for the circuits to ensure correct timing.The accelerator for handwriting recognition network(LeNet-5 model)is implemented.Silicon test results show that the asynchronous accelerator has 30%less power in computing array than the synchronous one and that the energy efficiency of the asynchronous accelerator achieves 1.538 TOPS/W,which is 12%higher than that of the synchronous chip.
基金the Beijing Natural Science Foundation under Grant No.JQ18013the National Natural Science Foundation of China under Grant Nos.61925208,61732007,61732002 and 61906179+1 种基金the Strategic Priority Research Program of Chinese Academy of Sciences(CAS)under Grant No.XDB32050200the Youth Innovation Promotion Association CAS,Beijing Academy of Artificial Intelligence(BAAI)and Xplore Prize.
文摘Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network applications.Meanwhile,with neural network architectures going deeper and wider,the limited memory capacity has become a constraint to deploy models on UNNA platforms.Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant.In this paper,we propose Tetris:a heuristic static memory management framework for UNNA platforms.Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness interval.Then the memory management problem is converted to a sequence permutation problem.Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints.We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods,and achieves an average memory reduction ratio of 91.9%and 87.9%for a quad-core and a 16-core Cambricon-X platform,respectively.
文摘With the support by the National Natural Science Foundation of China and the'Strategic Priority Research Program'of the Chinese Academy of Sciences,a collaborative study by the research groups led by Professors Tian Zhixi(田志喜),Wang Guodong(王国栋),and Zhu Baoge(朱保葛)from the
基金the National Natural Science Foundation of China(Grant Nos.61972167 and 61802135)the Project of Guangxi Science and Technology(Grant No.GuiKeAD21075030)+3 种基金the Guangxi“Bagui Scholar”Teams for Innovation and Research Projectthe Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processingthe Guangxi Talent Highland Project of Big Data Intelligence and Applicationthe Open Project Program of the National Laboratory of Pattern Recognition(NLPR)(Grant No.202000012)。
文摘Existing algorithms of dish recognition mainly focus on accuracy with predefined classes,thus limiting their application scope.In this paper,we propose a practical two-stage dish recognition framework(DRNet)that yields a tradeoff between speed and accuracy while adapting to the variation in class numbers.In the first stage,we build an arbitrary-oriented dish detector(AODD)to localize dish position,which can effectively alleviate the impact of background noise and pose variations.In the second stage,we propose a dish reidentifier(DReID)to recognize the registered dishes to handle uncertain categories.To further improve the accuracy of DRNet,we design an attribute recognition(AR)module to predict the attributes of dishes.The attributes are used as auxiliary information to enhance the discriminative ability of DRNet.Moreover,pruning and quantization are processed on our model to be deployed in embedded environments.Finally,to facilitate the study of dish recognition,a well-annotated dataset is established.Our AODD,DReID,AR,and DRNet run at about 14,25,16,and 5 fps on the hardware RKNN 3399 pro,respectively.
基金supported by the National Natural Science Foundation of China(61871132,62171135)。
文摘To tackle the challenge of applying convolutional neural network(CNN)in field-programmable gate array(FPGA)due to its computational complexity,a high-performance CNN hardware accelerator based on Verilog hardware description language was designed,which utilizes a pipeline architecture with three parallel dimensions including input channels,output channels,and convolution kernels.Firstly,two multiply-and-accumulate(MAC)operations were packed into one digital signal processing(DSP)block of FPGA to double the computation rate of the CNN accelerator.Secondly,strategies of feature map block partitioning and special memory arrangement were proposed to optimize the total amount of off-chip access memory and reduce the pressure on FPGA bandwidth.Finally,an efficient computational array combining multiplicative-additive tree and Winograd fast convolution algorithm was designed to balance hardware resource consumption and computational performance.The high parallel CNN accelerator was deployed in ZU3 EG of Alinx,using the YOLOv3-tiny algorithm as the test object.The average computing performance of the CNN accelerator is 127.5 giga operations per second(GOPS).The experimental results show that the hardware architecture effectively improves the computational power of CNN and provides better performance compared with other existing schemes in terms of power consumption and the efficiency of DSPs and block random access memory(BRAMs).