期刊文献+
共找到496篇文章
< 1 2 25 >
每页显示 20 50 100
A Novel Quantization and Model Compression Approach for Hardware Accelerators in Edge Computing
1
作者 Fangzhou He Ke Ding +3 位作者 DingjiangYan Jie Li Jiajun Wang Mingzhe Chen 《Computers, Materials & Continua》 SCIE EI 2024年第8期3021-3045,共25页
Massive computational complexity and memory requirement of artificial intelligence models impede their deploy-ability on edge computing devices of the Internet of Things(IoT).While Power-of-Two(PoT)quantization is pro... Massive computational complexity and memory requirement of artificial intelligence models impede their deploy-ability on edge computing devices of the Internet of Things(IoT).While Power-of-Two(PoT)quantization is pro-posed to improve the efficiency for edge inference of Deep Neural Networks(DNNs),existing PoT schemes require a huge amount of bit-wise manipulation and have large memory overhead,and their efficiency is bounded by the bottleneck of computation latency and memory footprint.To tackle this challenge,we present an efficient inference approach on the basis of PoT quantization and model compression.An integer-only scalar PoT quantization(IOS-PoT)is designed jointly with a distribution loss regularizer,wherein the regularizer minimizes quantization errors and training disturbances.Additionally,two-stage model compression is developed to effectively reduce memory requirement,and alleviate bandwidth usage in communications of networked heterogenous learning systems.The product look-up table(P-LUT)inference scheme is leveraged to replace bit-shifting with only indexing and addition operations for achieving low-latency computation and implementing efficient edge accelerators.Finally,comprehensive experiments on Residual Networks(ResNets)and efficient architectures with Canadian Institute for Advanced Research(CIFAR),ImageNet,and Real-world Affective Faces Database(RAF-DB)datasets,indicate that our approach achieves 2×∼10×improvement in the reduction of both weight size and computation cost in comparison to state-of-the-art methods.A P-LUT accelerator prototype is implemented on the Xilinx KV260 Field Programmable Gate Array(FPGA)platform for accelerating convolution operations,with performance results showing that P-LUT reduces memory footprint by 1.45×,achieves more than 3×power efficiency and 2×resource efficiency,compared to the conventional bit-shifting scheme. 展开更多
关键词 Edge computing model compression hardware accelerator power-of-two quantization
下载PDF
FPGA Accelerators for Computing Interatomic Potential-Based Molecular Dynamics Simulation for Gold Nanoparticles:Exploring Different Communication Protocols
2
作者 Ankitkumar Patel Srivathsan Vasudevan Satya Bulusu 《Computers, Materials & Continua》 SCIE EI 2024年第9期3803-3818,共16页
Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,... Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,despite using an expensive high-end server.Heterogeneous computing,a combination of the Field Programmable Gate Array(FPGA)and a computer,is proposed as a solution to compute MD simulation efficiently.In such heterogeneous computation,communication between FPGA and Computer is necessary.One such MD simulation,explained in the paper,is the(Artificial Neural Network)ANN-based IAP computation of gold(Au_(147)&Au_(309))nanoparticles.MD simulation calculates the forces between atoms and the total energy of the chemical system.This work proposes the novel design and implementation of an ANN IAP-based MD simulation for Au_(147)&Au_(309) using communication protocols,such as Universal Asynchronous Receiver-Transmitter(UART)and Ethernet,for communication between the FPGA and the host computer.To improve the latency of MD simulation through heterogeneous computing,Universal Asynchronous Receiver-Transmitter(UART)and Ethernet communication protocols were explored to conduct MD simulation of 50,000 cycles.In this study,computation times of 17.54 and 18.70 h were achieved with UART and Ethernet,respectively,compared to the conventional server time of 29 h for Au_(147) nanoparticles.The results pave the way for the development of a Lab-on-a-chip application. 展开更多
关键词 Ethernet hardware accelerator heterogeneous computing interatomic potential(IAP) MDsimulation peripheral component interconnect express(PCIe) UART
下载PDF
An FPGA-Based Resource-Saving Hardware Accelerator for Deep Neural Network
3
作者 Han Jia Xuecheng Zou 《International Journal of Intelligence Science》 2021年第2期57-69,共13页
With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous ve... With the development of computer vision researches, due to the state-of-the-art performance on image and video processing tasks, deep neural network (DNN) has been widely applied in various applications (autonomous vehicles, weather forecasting, counter-terrorism, surveillance, traffic management, etc.). However, to achieve such performance, DNN models have become increasingly complicated and deeper, and result in heavy computational stress. Thus, it is not sufficient for the general central processing unit (CPU) processors to meet the real-time application requirements. To deal with this bottleneck, research based on hardware acceleration solution for DNN attracts great attention. Specifically, to meet various real-life applications, DNN acceleration solutions mainly focus on issue of hardware acceleration with intense memory and calculation resource. In this paper, a novel resource-saving architecture based on Field Programmable Gate Array (FPGA) is proposed. Due to the novel designed processing element (PE), the proposed architecture </span><span style="font-family:Verdana;">achieves good performance with the extremely limited calculating resource. The on-chip buffer allocation helps enhance resource-saving performance on memory. Moreover, the accelerator improves its performance by exploiting</span> <span style="font-family:Verdana;">the sparsity property of the input feature map. Compared to other state-of-the-art</span><span style="font-family:Verdana;"> solutions based on FPGA, our architecture achieves good performance, with quite limited resource consumption, thus fully meet the requirement of real-time applications. 展开更多
关键词 Deep Neural Network RESOURCE-SAVING hardware accelerator Data Flow
下载PDF
THUBrachy:fast Monte Carlo dose calculation tool accelerated by heterogeneous hardware for high-dose-rate brachytherapy 被引量:1
4
作者 An-Kang Hu Rui Qiu +5 位作者 Huan Liu Zhen Wu Chun-Yan Li Hui Zhang Jun-Li Li Rui-Jie Yang 《Nuclear Science and Techniques》 SCIE EI CAS CSCD 2021年第3期107-119,共13页
The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible t... The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible to substantially accelerate calculations with hardware accelerators.Accordingly,this study develops a fast MC tool,called THUBrachy,which can be accelerated by several types of hardware accelerators.THUBrachy can simulate photons with energy less than 3 MeV and considers all photon interactions in the energy range.It was benchmarked against the American Association of Physicists in Medicine Task Group No.43 Report using a water phantom and validated with Geant4 using a clinical case.A performance test was conducted using the clinical case,showing that a multicore central processing unit,Intel Xeon Phi,and graphics processing unit(GPU)can efficiently accelerate the simulation.GPU-accelerated THUBrachy is the fastest version,which is 200 times faster than the serial version and approximately 500 times faster than Geant4.The proposed tool shows great potential for fast and accurate dose calculations in clinical applications. 展开更多
关键词 High-dose-rate brachytherapy Monte Carlo Heterogeneous computing hardware accelerators
下载PDF
System-on-a-Chip (SoC) Based Hardware Acceleration for Video Codec
5
作者 Xinwei Niu Jeffrey Fan 《Optics and Photonics Journal》 2013年第2期112-117,共6页
Nowadays, from home monitoring to large airport security, a lot of digital video surveillance systems have been used. Digital surveillance system usually requires streaming video processing abilities. As an advanced v... Nowadays, from home monitoring to large airport security, a lot of digital video surveillance systems have been used. Digital surveillance system usually requires streaming video processing abilities. As an advanced video coding method, H.264 is introduced to reduce the large video data dramatically (usually by 70X or more). However, computational overhead occurs when coding and decoding H.264 video. In this paper, a System-on-a-Chip (SoC) based hardware acceleration solution for video codec is proposed, which can also be used for other software applications. The characteristics of the video codec are analyzed by using the profiling tool. The Hadamard function, which is the bottleneck of H.264, is identified not only by execution time but also another two attributes, such as cycle per loop and loop round. The Co-processor approach is applied to accelerate the Hadamard function by transforming it to hardware. Performance improvement, resource costs and energy consumption are compared and analyzed. Experimental results indicate that 76.5% energy deduction and 8.09X speedup can be reached after balancing these three key factors. 展开更多
关键词 SOC Software PROFILING hardware accelerATION Video CODEC
下载PDF
Automatic Control System of Ion Electrostatic Accelerator and Anti-Interference Measures 被引量:1
6
作者 孙振武 霍裕平 +2 位作者 刘根成 李玉晓 李涛 《Plasma Science and Technology》 SCIE EI CAS CSCD 2007年第1期101-105,共5页
An automatic control system for the electrostatic accelerator has been developed by adopting the PLC (Programmable Logic Controller) control technique, infrared and optical-fibre transmission technique and network c... An automatic control system for the electrostatic accelerator has been developed by adopting the PLC (Programmable Logic Controller) control technique, infrared and optical-fibre transmission technique and network communication with the purpose to improve the intelligence level of the accelerator and to enhance the ability of monitoring, collecting and recording parameters. In view of the control system' structure, some anti-interference measures have been adopted after analyzing the interference sources. The measures in hardware include controlling the position of the corona needle, using surge arresters, shielding, ground connection and stabilizing the voltage. The measures in terms of software involve inter-blocking protection, soft-spacing, time delay, and diagnostic and protective programs. The electromagnetic compatible ability of the control system has thus been effectively improved. 展开更多
关键词 electrostatic accelerator computer control SOFTWARE hardware electromag- netic interference
下载PDF
A Dynamically Reconfigurable Accelerator Design Using a Sparse-Winograd Decomposition Algorithm for CNNs
7
作者 Yunping Zhao Jianzhuang Lu Xiaowen Chen 《Computers, Materials & Continua》 SCIE EI 2021年第1期517-535,共19页
Convolutional Neural Networks(CNNs)are widely used in many fields.Due to their high throughput and high level of computing characteristics,however,an increasing number of researchers are focusing on how to improve the... Convolutional Neural Networks(CNNs)are widely used in many fields.Due to their high throughput and high level of computing characteristics,however,an increasing number of researchers are focusing on how to improve the computational efficiency,hardware utilization,or flexibility of CNN hardware accelerators.Accordingly,this paper proposes a dynamically reconfigurable accelerator architecture that implements a Sparse-Winograd F(2×2.3×3)-based high-parallelism hardware architecture.This approach not only eliminates the pre-calculation complexity associated with the Winograd algorithm,thereby reducing the difficulty of hardware implementation,but also greatly improves the flexibility of the hardware;as a result,the accelerator can realize the calculation of Conventional Convolution,Grouped Convolution(GCONV)or Depthwise Separable Convolution(DSC)using the same hardware architecture.Our experimental results show that the accelerator achieves a 3x–4.14x speedup compared with the designs that do not use the acceleration algorithm on VGG-16 and MobileNet V1.Moreover,compared with previous designs using the traditional Winograd algorithm,the accelerator design achieves 1.4x–1.8x speedup.At the same time,the efficiency of the multiplier improves by up to 142%. 展开更多
关键词 High performance computing accelerator architecture hardware
下载PDF
Neural Networks on an FPGA and Hardware-Friendly Activation Functions
8
作者 Jiong Si Sarah L. Harris Evangelos Yfantis 《Journal of Computer and Communications》 2020年第12期251-277,共27页
This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Te... This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Technology (MNIST) database. We also propose a novel hardware-friendly activation function called the dynamic Rectifid Linear Unit (ReLU)—D-ReLU function that achieves higher performance than traditional activation functions at no cost to accuracy. We built a 2-layer online training multilayer perceptron (MLP) neural network on an FPGA with varying data width. Reducing the data width from 8 to 4 bits only reduces prediction accuracy by 11%, but the FPGA area decreases by 41%. Compared to networks that use the sigmoid functions, our proposed D-ReLU function uses 24% - 41% less area with no loss to prediction accuracy. Further reducing the data width of the 3-layer networks from 8 to 4 bits, the prediction accuracies only decrease by 3% - 5%, with area being reduced by 9% - 28%. Moreover, FPGA solutions have 29 times faster execution time, even despite running at a 60× lower clock rate. Thus, FPGA implementations of neural networks offer a high-performance, low power alternative to traditional software methods, and our novel D-ReLU activation function offers additional improvements to performance and power saving. 展开更多
关键词 Deep Learning D-ReLU Dynamic ReLU FPGA hardware acceleration Activation Function
下载PDF
Hardware Design of Moving Object Detection on Reconfigurable System
9
作者 Hung-Yu Chen Yuan-Kai Wang 《Journal of Computer and Communications》 2016年第10期30-43,共14页
Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper pro... Moving object detection including background subtraction and morphological processing is a critical research topic for video surveillance because of its high computational loading and power consumption. This paper proposes a hardware design to accelerate the computation of background subtraction with low power consumption. A real-time background subtraction method is designed with a frame-buffer scheme and function partition to improve throughput, and implemented using Verilog HDL on FPGA. The design parallelizes the computations of background update and subtraction with a seven-stage pipeline. A stripe-based morphological processing and accounting for the completion of detected objects is devised. Simulation results for videos of VGA resolutions on a low-end FPGA device show 368 fps throughput for only the real-time background subtraction module, and 51 fps for the whole system, including off-chip memory access. Real-time efficiency with low power consumption and low resource utilization is thus demonstrated. 展开更多
关键词 Background Substraction Moving Object Detection Field Programmable Gate Array (FPGA) hardware acceleration
下载PDF
面向边缘计算的可重构CNN协处理器研究与设计
10
作者 李伟 陈億 +2 位作者 陈韬 南龙梅 杜怡然 《电子与信息学报》 EI CAS CSCD 北大核心 2024年第4期1499-1512,共14页
随着深度学习技术的发展,卷积神经网络模型的参数量和计算量急剧增加,极大提高了卷积神经网络算法在边缘侧设备的部署成本。因此,为了降低卷积神经网络算法在边缘侧设备上的部署难度,减小推理时延和能耗开销,该文提出一种面向边缘计算... 随着深度学习技术的发展,卷积神经网络模型的参数量和计算量急剧增加,极大提高了卷积神经网络算法在边缘侧设备的部署成本。因此,为了降低卷积神经网络算法在边缘侧设备上的部署难度,减小推理时延和能耗开销,该文提出一种面向边缘计算的可重构CNN协处理器结构。基于按通道处理的数据流模式,提出的两级分布式存储方案解决了片上大规模的数据搬移和重构运算时PE单元间的大量数据移动导致的功耗开销和性能下降的问题;为了避免加速阵列中复杂的数据互联网络传播机制,降低控制的复杂度,该文提出一种灵活的本地访存机制和基于地址转换的填充机制,使得协处理器能够灵活实现任意规格的常规卷积、深度可分离卷积、池化和全连接运算,提升了硬件架构的灵活性。本文提出的协处理器包含256个PE运算单元和176 kB的片上私有存储器,在55 nm TT Corner(25°C,1.2 V)的CMOS工艺下进行逻辑综合和布局布线,最高时钟频率能够达到328 MHz,实现面积为4.41 mm^(2)。在320 MHz的工作频率下,该协处理器峰值运算性能为163.8 GOPs,面积效率为37.14GOPs/mm^(2),完成LeNet-5和MobileNet网络的能效分别为210.7 GOPs/W和340.08 GOPs/W,能够满足边缘智能计算场景下的能效和性能需求。 展开更多
关键词 硬件加速 卷积神经网络 可重构 ASIC
下载PDF
基于卷积神经网络的岩渣分类算法及其FPGA加速
11
作者 陈昌川 王新立 +5 位作者 朱嘉琪 张天骐 尹淑娟 王珩 魏琦 乔飞 《传感技术学报》 CAS CSCD 北大核心 2024年第1期80-88,共9页
全断面岩石掘进机在道路掘进过程中,刀盘挤压切削岩体容易产生刀盘磨损及损坏,从而造成经济损失,因此需要检测刀盘磨损的理论和技术来指导施工。岩渣是掘进过程的直接产物,携带丰富的信息,能够反映当前的施工状况,因此可以通过岩渣识别... 全断面岩石掘进机在道路掘进过程中,刀盘挤压切削岩体容易产生刀盘磨损及损坏,从而造成经济损失,因此需要检测刀盘磨损的理论和技术来指导施工。岩渣是掘进过程的直接产物,携带丰富的信息,能够反映当前的施工状况,因此可以通过岩渣识别利用这些信息间接实现对刀盘的监测。提出了一种基于卷积神经网络的岩渣识别算法,在岩渣数据集上实现了96.5%的分类准确率。随后为了便于FPGA硬件部署,提出一种网络压缩方法,将网络规模压缩到原始网络的2.28%,同时分类准确率相比原网络仅下降了0.9%。最后使用OpenCL技术在Intel Arria 10 GX1150平台上实现了算法部署,达到了224.54 GOP/s的吞吐率以及11.23 GOP/s/W的能效比。 展开更多
关键词 岩渣分类 FPGA 卷积神经网络 OPENCL 硬件加速
下载PDF
动态深度神经网络的硬件加速设计及FPGA实现
12
作者 王鹏 任轶群 +1 位作者 范毓洋 张嘉诚 《电讯技术》 北大核心 2024年第3期358-365,共8页
基于现场可编程门阵列(Field Programmable Gate Array,FPGA)实现的卷积神经网络由于具有优秀的目标识别能力,广泛应用在边缘设备。然而现有的神经网络部署多基于静态模型,因此存在无效特征提取、计算量增大、帧率降低等问题。为此,提... 基于现场可编程门阵列(Field Programmable Gate Array,FPGA)实现的卷积神经网络由于具有优秀的目标识别能力,广泛应用在边缘设备。然而现有的神经网络部署多基于静态模型,因此存在无效特征提取、计算量增大、帧率降低等问题。为此,提出了动态深度神经网络的实现方法。通过引入模型定点压缩技术和并行的卷积分块方法,并结合低延迟的数据调度策略,实现了高效卷积计算。同时对神经网络动态退出机制中引入的交叉熵损失函数,提出便于硬件实现的简化方法,设计专用的加速电路。根据所提方法,在Xilinx xc7z030平台部署了具有动态深度的ResNet110网络,平台最高可完成2.78×104 MOPS(Million Operations per Second)的乘积累加运算,并支持1.25 MOPS的自然指数运算和0.125 MOPS的对数运算,相较于i7-5960x处理器加速比达到287%,相较于NVIDIA TITAN X处理器加速比达到145%。 展开更多
关键词 边缘设备 动态深度神经网络 动态退出机制 硬件加速 加速电路
下载PDF
规则压缩模型和灵活架构的Transformer加速器设计
13
作者 姜小波 邓晗珂 +1 位作者 莫志杰 黎红源 《电子与信息学报》 EI CAS CSCD 北大核心 2024年第3期1079-1088,共10页
基于注意力机制的Transformer模型具有优越的性能,设计专用的Transformer加速器能大幅提高推理性能以及降低推理功耗。Transformer模型复杂性包括数量上和结构上的复杂性,其中结构上的复杂性导致不规则模型和规则硬件之间的失配,降低了... 基于注意力机制的Transformer模型具有优越的性能,设计专用的Transformer加速器能大幅提高推理性能以及降低推理功耗。Transformer模型复杂性包括数量上和结构上的复杂性,其中结构上的复杂性导致不规则模型和规则硬件之间的失配,降低了模型映射到硬件的效率。目前的加速器研究主要聚焦在解决模型数量上的复杂性,但对如何解决模型结构上的复杂性研究得不多。该文首先提出规则压缩模型,降低模型的结构复杂度,提高模型和硬件的匹配度,提高模型映射到硬件的效率。接着提出一种硬件友好的模型压缩方法,采用规则的偏移对角权重剪枝方案和简化硬件量化推理逻辑。此外,提出一个高效灵活的硬件架构,包括一种以块为单元的权重固定脉动运算阵列,同时包括一种准分布的存储架构。该架构可以高效实现算法到运算阵列的映射,同时实现高效的数据存储效率和降低数据移动。实验结果表明,该文工作在性能损失极小的情况下实现93.75%的压缩率,在FPGA上实现的加速器可以高效处理压缩后的Transformer模型,相比于中央处理器(CPU)和图形处理器(GPU)能效分别提高了12.45倍和4.17倍。 展开更多
关键词 自然语音处理 TRANSFORMER 模型压缩 硬件加速器 机器翻译
下载PDF
面向微控制器的卷积神经网络加速器设计
14
作者 乔建华 吴言 +1 位作者 栗亚宁 雷光政 《电子器件》 CAS 2024年第1期48-54,共7页
针对目前嵌入式微控制器的性能难以满足实时图像识别任务的问题,提出一种适用于微控制器的卷积神经网络加速器。该加速器在卷积层设计了无阻塞的行并行乘法-加法树结构,获得了更高的硬件利用率;为了满足行并行的数据吞吐量,设计了卷积专... 针对目前嵌入式微控制器的性能难以满足实时图像识别任务的问题,提出一种适用于微控制器的卷积神经网络加速器。该加速器在卷积层设计了无阻塞的行并行乘法-加法树结构,获得了更高的硬件利用率;为了满足行并行的数据吞吐量,设计了卷积专用SRAM存储器。加速器将池化和激活单元融入数据通路,有效减少数据重复存取带来的时间开销。FPGA原型验证表明加速器的性能达到92.2 GOPS@100 MHz;基于TSMC 130 nm工艺节点进行逻辑综合,加速器的动态功耗为33 mW,面积为90 764.2μm^(2),能效比高达2 793 GOPS/W,比FPGA加速器方案提高了约100倍。该加速器低功耗、低成本的特性,有利于实现嵌入式系统在目标检测、人脸识别等机器视觉领域的广泛应用。 展开更多
关键词 卷积神经网络 并行计算 流水线 硬件加速器 专用集成电路
下载PDF
时空图卷积网络的骨架识别硬件加速器设计
15
作者 谭会生 严舒琪 杨威 《电子测量技术》 北大核心 2024年第11期36-43,共8页
随着人工智能技术的不断发展,神经网络的数据规模逐渐扩大,神经网络的计算量也迅速攀升。为了减少时空图卷积神经网络的计算量,降低硬件实现的资源消耗,提升人体骨架识别时空图卷积神经网络(ST-GCN)实际应用系统的处理速度,利用现场可... 随着人工智能技术的不断发展,神经网络的数据规模逐渐扩大,神经网络的计算量也迅速攀升。为了减少时空图卷积神经网络的计算量,降低硬件实现的资源消耗,提升人体骨架识别时空图卷积神经网络(ST-GCN)实际应用系统的处理速度,利用现场可编程门阵列(FPGA),设计开发了一个基于时空图卷积神经网络的骨架识别硬件加速器。通过对原网络模型进行结构优化与数据量化,减少了FPGA实现约75%的计算量;利用邻接矩阵稀疏性的特点,提出了一种稀疏性矩阵乘加运算的优化方法,减少了约60%的乘法器资源消耗。经过对人体骨架识别实验验证,结果表明,在时钟频率100 MHz下,相较于CPU,FPGA加速ST-GCN单元,加速比达到30.53;FPGA加速人体骨架识别,加速比达到6.86。 展开更多
关键词 人体骨架识别 时空图卷积神经网络(ST-GCN) 硬件加速器 现场可编程门阵列(FPGA) 稀疏矩阵乘加运算硬件优化
下载PDF
数字电路芯片验证方法研究综述
16
作者 曹庆年 姚翌 孟开元 《工业控制计算机》 2024年第2期53-55,83,共4页
目前验证的方法和工具种类繁多,对动态验证、静态验证和硬件加速仿真三种典型的芯片验证方法进行综述,在验证设计质量、验证时间和验证成本三个方面说明各自的特点。最后提出了芯片验证的展望。对于芯片验证方法的研究,以期对后续研究... 目前验证的方法和工具种类繁多,对动态验证、静态验证和硬件加速仿真三种典型的芯片验证方法进行综述,在验证设计质量、验证时间和验证成本三个方面说明各自的特点。最后提出了芯片验证的展望。对于芯片验证方法的研究,以期对后续研究人员了解芯片验证方法,在今后的芯片验证设计过程中结合设计特点,选择正确的验证方法有所帮助。 展开更多
关键词 动态验证 静态验证 硬件加速仿真
下载PDF
应用于锂电池SOC估计的PCNN_LSTM硬件加速器设计
17
作者 王巍 夏旭 +2 位作者 丁辉 吴浩 郭家成 《微电子学与计算机》 2024年第10期106-116,共11页
为了克服传统的锂电池状态估计效果差、计算效率低和能效低等问题,提出一种应用于锂电池荷电状态(Stateof Charge,SOC)估计的PCNN_LSTM算法与硬件加速器设计。该算法结合了卷积神经网络和长短期记忆神经网络的特点,可以提取输入数据的... 为了克服传统的锂电池状态估计效果差、计算效率低和能效低等问题,提出一种应用于锂电池荷电状态(Stateof Charge,SOC)估计的PCNN_LSTM算法与硬件加速器设计。该算法结合了卷积神经网络和长短期记忆神经网络的特点,可以提取输入数据的空间特征和时间特征,从而实现更准确的估计效果。为了进一步提高计算效率,设计了基于现场可编程逻辑门阵列(FPGA)的硬件加速器。该加速器利用FPGA的并行计算和片上存储特性,通过并行流水和模块折叠复用的方式来优化卷积运算和矩阵乘法,采用分段线性拟合和移位的方式实现激活函数模块,以及采用分时复用策略实现element_wise模块。在保证精度的同时,有效减少了硬件资源的消耗,提高了整体性能。实验结果表明,在Zynq UltraScale+MPSoC ZCU102 FPGA上实现了一个输入时钟频率为100 MHz的PCNN-LSTM加速器,其峰值吞吐量为75.84GOP/s,能效比为60.915GOP/W。 展开更多
关键词 锂电池 荷电状态 卷积神经网络 长短期记忆神经网络 FPGA 硬件加速
下载PDF
全同态加密软硬件加速研究进展 被引量:1
18
作者 边松 毛苒 +8 位作者 朱永清 傅云濠 张舟 丁林 张吉良 张博 陈弈 董进 关振宇 《电子与信息学报》 EI CAS CSCD 北大核心 2024年第5期1790-1805,共16页
全同态加密(FHE)是一种重计算、轻交互的多方安全计算协议。在基于全同态加密的计算协议中,尽管计算参与方之间无需多轮交互与大量通信,加密状态下的密态数据处理时间通常是明文计算的10~3~10~6倍,极大地阻碍了这类计算协议的实际落地;... 全同态加密(FHE)是一种重计算、轻交互的多方安全计算协议。在基于全同态加密的计算协议中,尽管计算参与方之间无需多轮交互与大量通信,加密状态下的密态数据处理时间通常是明文计算的10~3~10~6倍,极大地阻碍了这类计算协议的实际落地;而密态数据上的主要处理负担是大规模的并行密码运算和运算所必须的密文及密钥数据搬运需求。该文聚焦软、硬件两个层面上的全同态加密加速这一研究热点,通过系统性地归类及整理当前领域中的文献,讨论全同态加密计算加速的研究现状与展望。 展开更多
关键词 全同态加密 同态算法 密码硬件加速
下载PDF
塑封集成电路HAST影响因素探讨
19
作者 杨永兴 张强 +3 位作者 侯旎璐 武玉杰 许春强 高海龙 《电子质量》 2024年第8期42-47,共6页
随着塑封集成电路在人工智能、移动互联网、物联网、云计算和大数据等领域的广泛应用,其面临长时间加电连续工作的需求,因此需要对其长期可靠性进行研究。结合试验实例,分别从设备、试验硬件和导线材质等方面对塑封集成电路高加速应力试... 随着塑封集成电路在人工智能、移动互联网、物联网、云计算和大数据等领域的广泛应用,其面临长时间加电连续工作的需求,因此需要对其长期可靠性进行研究。结合试验实例,分别从设备、试验硬件和导线材质等方面对塑封集成电路高加速应力试验(HAST)的影响因素进行了分析。分析结果表明采用高纯水、优化温湿度控制程序可避免凝露引发的集成电路试验过程失效;导电性阳极丝(CAF)和电化学迁移(ECM)是HAST中常见失效问题,试验硬件与工艺需做好CAF和ECM防控;试验线材会影响HAST,线材绝缘层与线芯材质要定期监测,适时更换。 展开更多
关键词 塑封集成电路 可靠性 高加速应力试验 影响因素 凝露 试验用水 试验硬件 导线
下载PDF
基于FPGA的图像处理硬件加速系统的设计 被引量:1
20
作者 张灿宇 封岸松 +2 位作者 张华良 易星 王俊彭 《计算机工程与设计》 北大核心 2024年第3期723-731,共9页
为解决图像处理算法越来越复杂,普通的计算平台已满足不了当前需求的问题,根据现场可编程门阵列(field programmable gate array, FPGA)的并行计算特点对FAST角点检测算法和Sobel边缘检测算法进行硬件加速,采用HLS(high-level synthesis... 为解决图像处理算法越来越复杂,普通的计算平台已满足不了当前需求的问题,根据现场可编程门阵列(field programmable gate array, FPGA)的并行计算特点对FAST角点检测算法和Sobel边缘检测算法进行硬件加速,采用HLS(high-level synthesis, HLS)高层次综合技术对两种算法进行设计并进行相应的优化。为提升系统整体性能,在FPGA上实现全部视频输入输出接口和图像算法的完整通路,通过FPGA算法电路与OpenCV算法程序进行对比,前者的图像处理速度快于后者9~11倍,系统功耗也仅为1.9 W,图像检测可达56 fps,满足实时图像处理要求,为以后设计复杂的图像处理系统提供了参考。 展开更多
关键词 现场可编程门阵列 硬件加速 高层次综合技术 图像处理 PYNQ-Z2 角点检测 边缘检测
下载PDF
上一页 1 2 25 下一页 到第
使用帮助 返回顶部