期刊文献+
共找到11篇文章
< 1 >
每页显示 20 50 100
THUBrachy:fast Monte Carlo dose calculation tool accelerated by heterogeneous hardware for high-dose-rate brachytherapy 被引量:1
1
作者 An-Kang Hu Rui Qiu +5 位作者 Huan Liu Zhen Wu Chun-Yan Li Hui Zhang Jun-Li Li Rui-Jie Yang 《Nuclear Science and Techniques》 SCIE EI CAS CSCD 2021年第3期107-119,共13页
The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible t... The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible to substantially accelerate calculations with hardware accelerators.Accordingly,this study develops a fast MC tool,called THUBrachy,which can be accelerated by several types of hardware accelerators.THUBrachy can simulate photons with energy less than 3 MeV and considers all photon interactions in the energy range.It was benchmarked against the American Association of Physicists in Medicine Task Group No.43 Report using a water phantom and validated with Geant4 using a clinical case.A performance test was conducted using the clinical case,showing that a multicore central processing unit,Intel Xeon Phi,and graphics processing unit(GPU)can efficiently accelerate the simulation.GPU-accelerated THUBrachy is the fastest version,which is 200 times faster than the serial version and approximately 500 times faster than Geant4.The proposed tool shows great potential for fast and accurate dose calculations in clinical applications. 展开更多
关键词 High-dose-rate brachytherapy Monte Carlo Heterogeneous computing hardware accelerators
下载PDF
A Novel Quantization and Model Compression Approach for Hardware Accelerators in Edge Computing
2
作者 Fangzhou He Ke Ding +3 位作者 DingjiangYan Jie Li Jiajun Wang Mingzhe Chen 《Computers, Materials & Continua》 SCIE EI 2024年第8期3021-3045,共25页
Massive computational complexity and memory requirement of artificial intelligence models impede their deploy-ability on edge computing devices of the Internet of Things(IoT).While Power-of-Two(PoT)quantization is pro... Massive computational complexity and memory requirement of artificial intelligence models impede their deploy-ability on edge computing devices of the Internet of Things(IoT).While Power-of-Two(PoT)quantization is pro-posed to improve the efficiency for edge inference of Deep Neural Networks(DNNs),existing PoT schemes require a huge amount of bit-wise manipulation and have large memory overhead,and their efficiency is bounded by the bottleneck of computation latency and memory footprint.To tackle this challenge,we present an efficient inference approach on the basis of PoT quantization and model compression.An integer-only scalar PoT quantization(IOS-PoT)is designed jointly with a distribution loss regularizer,wherein the regularizer minimizes quantization errors and training disturbances.Additionally,two-stage model compression is developed to effectively reduce memory requirement,and alleviate bandwidth usage in communications of networked heterogenous learning systems.The product look-up table(P-LUT)inference scheme is leveraged to replace bit-shifting with only indexing and addition operations for achieving low-latency computation and implementing efficient edge accelerators.Finally,comprehensive experiments on Residual Networks(ResNets)and efficient architectures with Canadian Institute for Advanced Research(CIFAR),ImageNet,and Real-world Affective Faces Database(RAF-DB)datasets,indicate that our approach achieves 2×∼10×improvement in the reduction of both weight size and computation cost in comparison to state-of-the-art methods.A P-LUT accelerator prototype is implemented on the Xilinx KV260 Field Programmable Gate Array(FPGA)platform for accelerating convolution operations,with performance results showing that P-LUT reduces memory footprint by 1.45×,achieves more than 3×power efficiency and 2×resource efficiency,compared to the conventional bit-shifting scheme. 展开更多
关键词 Edge computing model compression hardware accelerator power-of-two quantization
下载PDF
Neural Networks on an FPGA and Hardware-Friendly Activation Functions
3
作者 Jiong Si Sarah L. Harris Evangelos Yfantis 《Journal of Computer and Communications》 2020年第12期251-277,共27页
This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Te... This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Technology (MNIST) database. We also propose a novel hardware-friendly activation function called the dynamic Rectifid Linear Unit (ReLU)—D-ReLU function that achieves higher performance than traditional activation functions at no cost to accuracy. We built a 2-layer online training multilayer perceptron (MLP) neural network on an FPGA with varying data width. Reducing the data width from 8 to 4 bits only reduces prediction accuracy by 11%, but the FPGA area decreases by 41%. Compared to networks that use the sigmoid functions, our proposed D-ReLU function uses 24% - 41% less area with no loss to prediction accuracy. Further reducing the data width of the 3-layer networks from 8 to 4 bits, the prediction accuracies only decrease by 3% - 5%, with area being reduced by 9% - 28%. Moreover, FPGA solutions have 29 times faster execution time, even despite running at a 60× lower clock rate. Thus, FPGA implementations of neural networks offer a high-performance, low power alternative to traditional software methods, and our novel D-ReLU activation function offers additional improvements to performance and power saving. 展开更多
关键词 Deep Learning D-ReLU Dynamic ReLU FPGA hardware Acceleration Activation Function
下载PDF
An FPGA-Based Resource-Saving Hardware Accelerator for Deep Neural Network
4
作者 Han Jia Xuecheng Zou 《International Journal of Intelligence Science》 2021年第2期57-69,共13页
With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous ve... With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous vehicles, weather forecasting, counter-terrorism, surveillance, traffic management, etc.). However, to achieve such performance, DNN models have become increasingly complicated and deeper, and result in heavy computational stress. Thus, it is not sufficient for the general central processing unit (CPU) processors to meet the real-time application requirements. To deal with this bottleneck, research based on hardware acceleration solution for DNN attracts great attention. Specifically, to meet various real-life applications, DNN acceleration solutions mainly focus on issue of hardware acceleration with intense memory and calculation resource. In this paper, a novel resource-saving architecture based on Field Programmable Gate Array (FPGA) is proposed. Due to the novel designed processing element (PE), the proposed architecture </span><span style="font-family:Verdana;">achieves good performance with the extremely limited calculating resource. The on-chip buffer allocation helps enhance resource-saving performance on memory. Moreover, the accelerator improves its performance by exploiting</span> <span style="font-family:Verdana;">the sparsity property of the input feature map. Compared to other state-of-the-art</span><span style="font-family:Verdana;"> solutions based on FPGA, our architecture achieves good performance, with quite limited resource consumption, thus fully meet the requirement of real-time applications. 展开更多
关键词 Deep Neural Network RESOURCE-SAVING hardware Accelerator Data Flow
下载PDF
Hardware Design of Moving Object Detection on Reconfigurable System
5
作者 Hung-Yu Chen Yuan-Kai Wang 《Journal of Computer and Communications》 2016年第10期30-43,共14页
Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper pro... Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper proposes a hardware design to accelerate the computation of background subtraction with low power consumption. A real-time background subtraction method is designed with a frame-buffer scheme and function partition to improve throughput, and implemented using Verilog HDL on FPGA. The design parallelizes the computations of background update and subtraction with a seven-stage pipeline. A stripe-based morphological processing and accounting for the completion of detected objects is devised. Simulation results for videos of VGA resolutions on a low-end FPGA device show 368 fps throughput for only the real-time background subtraction module, and 51 fps for the whole system, including off-chip memory access. Real-time efficiency with low power consumption and low resource utilization is thus demonstrated. 展开更多
关键词 Background Substraction Moving Object Detection Field Programmable Gate Array (FPGA) hardware Acceleration
下载PDF
FPGA Accelerators for Computing Interatomic Potential-Based Molecular Dynamics Simulation for Gold Nanoparticles:Exploring Different Communication Protocols
6
作者 Ankitkumar Patel Srivathsan Vasudevan Satya Bulusu 《Computers, Materials & Continua》 SCIE EI 2024年第9期3803-3818,共16页
Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,... Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,despite using an expensive high-end server.Heterogeneous computing,a combination of the Field Programmable Gate Array(FPGA)and a computer,is proposed as a solution to compute MD simulation efficiently.In such heterogeneous computation,communication between FPGA and Computer is necessary.One such MD simulation,explained in the paper,is the(Artificial Neural Network)ANN-based IAP computation of gold(Au_(147)&Au_(309))nanoparticles.MD simulation calculates the forces between atoms and the total energy of the chemical system.This work proposes the novel design and implementation of an ANN IAP-based MD simulation for Au_(147)&Au_(309) using communication protocols,such as Universal Asynchronous Receiver-Transmitter(UART)and Ethernet,for communication between the FPGA and the host computer.To improve the latency of MD simulation through heterogeneous computing,Universal Asynchronous Receiver-Transmitter(UART)and Ethernet communication protocols were explored to conduct MD simulation of 50,000 cycles.In this study,computation times of 17.54 and 18.70 h were achieved with UART and Ethernet,respectively,compared to the conventional server time of 29 h for Au_(147) nanoparticles.The results pave the way for the development of a Lab-on-a-chip application. 展开更多
关键词 Ethernet hardware accelerator heterogeneous computing interatomic potential(IAP) MDsimulation peripheral component interconnect express(PCIe) UART
下载PDF
Research on High-Precision Stochastic Computing VLSI Structures for Deep Neural Network Accelerators
7
作者 WU Jingguo ZHU Jingwei +3 位作者 XIONG Xiankui YAO Haidong WANG Chengchen CHEN Yun 《ZTE Communications》 2024年第4期9-17,共9页
Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and ener... Deep neural networks(DNN)are widely used in image recognition,image classification,and other fields.However,as the model size increases,the DNN hardware accelerators face the challenge of higher area overhead and energy consumption.In recent years,stochastic computing(SC)has been considered a way to realize deep neural networks and reduce hardware consumption.A probabilistic compensation algorithm is proposed to solve the accuracy problem of stochastic calculation,and a fully parallel neural network accelerator based on a deterministic method is designed.The software simulation results show that the accuracy of the probability compensation algorithm on the CIFAR-10 data set is 95.32%,which is 14.98%higher than that of the traditional SC algorithm.The accuracy of the deterministic algorithm on the CIFAR-10 dataset is 95.06%,which is 14.72%higher than that of the traditional SC algorithm.The results of Very Large Scale Integration Circuit(VLSI)hardware tests show that the normalized energy efficiency of the fully parallel neural network accelerator based on the deterministic method is improved by 31%compared with the circuit based on binary computing. 展开更多
关键词 stochastic computing hardware accelerator deep neural network
下载PDF
Towards High-Performance Graph Processing: From a Hardware/Software Co-Design Perspective
8
作者 廖小飞 赵文举 +7 位作者 金海 姚鹏程 黄禹 王庆刚 赵进 郑龙 张宇 邵志远 《Journal of Computer Science & Technology》 SCIE EI CSCD 2024年第2期245-266,共22页
Graph processing has been widely used in many scenarios,from scientific computing to artificial intelligence.Graph processing exhibits irregular computational parallelism and random memory accesses,unlike traditional ... Graph processing has been widely used in many scenarios,from scientific computing to artificial intelligence.Graph processing exhibits irregular computational parallelism and random memory accesses,unlike traditional workloads.Therefore,running graph processing workloads on conventional architectures(e.g.,CPUs and GPUs)often shows a significantly low compute-memory ratio with few performance benefits,which can be,in many cases,even slower than a specialized single-thread graph algorithm.While domain-specific hardware designs are essential for graph processing,it is still challenging to transform the hardware capability to performance boost without coupled software codesigns.This article presents a graph processing ecosystem from hardware to software.We start by introducing a series of hardware accelerators as the foundation of this ecosystem.Subsequently,the codesigned parallel graph systems and their distributed techniques are presented to support graph applications.Finally,we introduce our efforts on novel graph applications and hardware architectures.Extensive results show that various graph applications can be efficiently accelerated in this graph processing ecosystem. 展开更多
关键词 graph processing hardware accelerator software system high performance ECOSYSTEM
原文传递
Real-time pre-processing system with hardware accelerator for mobile core networks 被引量:1
9
作者 Mian CHENG Jin-shu SU Jing XU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2017年第11期1720-1731,共12页
With the rapidly increasing number of mobile devices being used as essential terminals or platforms for communication, security threats now target the whole telecommunication infrastructure and become increasingly ser... With the rapidly increasing number of mobile devices being used as essential terminals or platforms for communication, security threats now target the whole telecommunication infrastructure and become increasingly serious. Network probing tools, which are deployed as a bypass device at a mobile core network gateway, can collect and analyze all the traffic for security detection. However, due to the ever-increasing link speed, it is of vital importance to offioad the processing pressure of the detection system. In this paper, we design and evaluate a real-time pre-processing system, which includes a hardware accelerator and a multi-core processor. The implemented prototype can quickly restore each encapsulated packet and effectively distribute traffic to multiple back-end detection systems. We demonstrate the prototype in a well-deployed network environment with large volumes of real data. Experimental results show that our system can achieve at least 18 Gb/s with no packet loss with all kinds of communication protocols. 展开更多
关键词 Mobile network Real-time processing hardware acceleration
原文传递
Hardware Acceleration for SLAM in Mobile Systems
10
作者 樊哲 郝一帆 +2 位作者 支天 郭崎 杜子东 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第6期1300-1322,共23页
The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-po... The emerging mobile robot industry has spurred a flurry of interest in solving the simultaneous localization and mapping(SLAM)problem.However,existing SLAM platforms have difficulty in meeting the real-time and low-pow-er requirements imposed by mobile systems.Though specialized hardware is promising with regard to achieving high per-formance and lowering the power,designing an efficient accelerator for SLAM is severely hindered by a wide variety of SLAM algorithms.Based on our detailed analysis of representative SLAM algorithms,we observe that SLAM algorithms advance two challenges for designing efficient hardware accelerators:the large number of computational primitives and ir-regular control flows.To address these two challenges,we propose a hardware accelerator that features composable com-putation units classified as the matrix,vector,scalar,and control units.In addition,we design a hierarchical instruction set for coping with a broad range of SLAM algorithms with irregular control flows.Experimental results show that,com-pared against an Intel x86 processor,on average,our accelerator with the area of 7.41 mm^(2) achieves 10.52x and 112.62x better performance and energy savings,respectively,across different datasets.Compared against a more energy-efficient ARM Cortex processor,our accelerator still achieves 33.03x and 62.64x better performance and energy savings,respec-tively. 展开更多
关键词 hardware accelerator instruction set mobile system simultaneous localization and mapping(SLAM)algorithm
原文传递
Recent advances in efficient computation of deep convolutional neural networks 被引量:37
11
作者 Jian CHENG Pei-song WANG +2 位作者 Gang LI Qing-hao HU Han-qing LU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2018年第1期64-77,共14页
Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems.At the same time,the computational complexity and resource consumption of t... Deep neural networks have evolved remarkably over the past few years and they are currently the fundamental tools of many intelligent systems.At the same time,the computational complexity and resource consumption of these networks continue to increase.This poses a significant challenge to the deployment of such networks,especially in real-time applications or on resource-limited devices.Thus,network acceleration has become a hot topic within the deep learning community.As for hardware implementation of deep neural networks,a batch of accelerators based on a field-programmable gate array(FPGA) or an application-specific integrated circuit(ASIC)have been proposed in recent years.In this paper,we provide a comprehensive survey of recent advances in network acceleration,compression,and accelerator design from both algorithm and hardware points of view.Specifically,we provide a thorough analysis of each of the following topics:network pruning,low-rank approximation,network quantization,teacher–student networks,compact network design,and hardware accelerators.Finally,we introduce and discuss a few possible future directions. 展开更多
关键词 Deep neural networks Acceleration Compression hardware accelerator
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部