模型深度的不断增加和处理序列长度的不一致对循环神经网络在不同处理器上的性能优化提出巨大挑战。针对自主研制的长向量处理器FT-M7032,实现了一个高效的循环神经网络加速引擎。该引擎采用行优先矩阵向量乘算法和数据感知的多核并行方...模型深度的不断增加和处理序列长度的不一致对循环神经网络在不同处理器上的性能优化提出巨大挑战。针对自主研制的长向量处理器FT-M7032,实现了一个高效的循环神经网络加速引擎。该引擎采用行优先矩阵向量乘算法和数据感知的多核并行方式,提高矩阵向量乘的计算效率;采用两级内核融合优化方法降低临时数据传输的开销;采用手写汇编优化多种算子,进一步挖掘长向量处理器的性能潜力。实验表明,长向量处理器循环神经网络推理引擎可获得较高性能,相较于多核ARM CPU以及Intel Golden CPU,类循环神经网络模型长短记忆网络可获得最高62.68倍和3.12倍的性能加速。展开更多
Due to the fact that the register files seriously affect the performance and area of coarse-grained reconfigurable cryptographic processors, an efficient structure of the distributed cross-domain register file is prop...Due to the fact that the register files seriously affect the performance and area of coarse-grained reconfigurable cryptographic processors, an efficient structure of the distributed cross-domain register file is proposed to realize a cryptographic processor with a high performance and a lowarea cost. In order to meet the demands of high performance and high flexibility at a lowarea cost, a union structure with the multi-ports access structure, i, e., a distributed crossdomain register file, is designed by analyzing the algorithm features of different ciphers. Considering different algorithm requirements of the global register files and local register files,the circuit design is realized by adopting different design parameters under TSMC( Taiwan Semiconductor Manufacturing Company) 40 nm CMOS( complementary metal oxide semiconductor) technology and compared with other similar works. The experimental results showthat the proposed distributed cross-domain register structure can effectively improve the performance of the unit area, of which the total performance of block per cycle is improved by17. 79% and performance of block per cycle per area is improved by 117%.展开更多
Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill whic...Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill which the iteration is executed asynchronously: Each processor calculate the solution of an individual nonlinear system belong to its nonlinear multisplitting and can update the global approximation residing in the shared memory at any time. A local convergence analysis of this model is presented. Finally, we give a uumerical example which shows a 'strange' property that speedup Sp > p and efficiency Ep > 1.展开更多
矩阵转置是矩阵运算的基本操作,广泛应用于信号处理、科学计算以及深度学习等各种领域。随着国防科技大学自主研制的飞腾异构多核数字信号处理器(digital signal processor, DSP)在各种领域中的推广应用,对高性能矩阵转置实现提出了强...矩阵转置是矩阵运算的基本操作,广泛应用于信号处理、科学计算以及深度学习等各种领域。随着国防科技大学自主研制的飞腾异构多核数字信号处理器(digital signal processor, DSP)在各种领域中的推广应用,对高性能矩阵转置实现提出了强烈需求。针对飞腾异构多核DSP的体系结构特征与矩阵转置操作的特点,提出了一种适配不同数据位宽(8 B、4 B以及2 B)矩阵的并行矩阵转置算法ftmMT。该算法基于DSP中向量处理单元的Load/Store部件实现了向量化,同时基于矩阵分块实现了多个DSP核的并行处理,通过隐式乒乓设计实现了片上向量化转置与片外访存的重叠以及访存性能的大幅提升。实验结果表明,ftmMT能够显著加快矩阵转置操作,与CPU上的开源转置库HPTT相比,可获得高达8.99倍的性能加速。展开更多
基于异构众核处理器的超级计算机已经成为TOP500高性能计算机的主流,BSP、LogP、PRAM等已有并行计算模型均针对基于多核处理器的超级计算机设计,不能满足日益迫切的基于众核架构的超级计算机和应用发展需求.本文面向“神威·太湖之...基于异构众核处理器的超级计算机已经成为TOP500高性能计算机的主流,BSP、LogP、PRAM等已有并行计算模型均针对基于多核处理器的超级计算机设计,不能满足日益迫切的基于众核架构的超级计算机和应用发展需求.本文面向“神威·太湖之光”和神威E级原型系统的众核体系结构特点,提出P-PALN(Parallel-Parallel Access via LDM&NOC)并行计算模型,对于计算节点间的并行,该模型沿用BSP/LogP模型描述;对于计算节点内的众核并行,该模型提供私有存储访问和片上阵列通信的众核并行架构的有效描述PALN,能够协助用户进行众核并行算法设计,并在申威众核处理器硬件设计中指导参数的优化.实验结果表明,该模型可有效指导硬件设计和用户众核编程,从而提高系统和应用的性能.展开更多
三维高效视频编码(3D High Efficiency Video Coding,3D-HEVC)中视差估计算法存在处理数据量大、运算时间长和资源消耗大的问题,进一步提高算法执行效率对于3D-HEVC的推广应用具有十分重要的意义。在深入分析视差估计算法的并行性的基础...三维高效视频编码(3D High Efficiency Video Coding,3D-HEVC)中视差估计算法存在处理数据量大、运算时间长和资源消耗大的问题,进一步提高算法执行效率对于3D-HEVC的推广应用具有十分重要的意义。在深入分析视差估计算法的并行性的基础上,基于项目组开发的视频阵列处理器(DPR-CODEC),提出一种新的并行实现方案。在可重构阵列结构中完成了视差估计算法的并行映射、功能仿真和FPGA测试,显著减少了视差估计算法的执行时间。实验结果表明,所提出的并行实现方案相比于串行单PE执行时间节省了59%,基于可编程可重构阵列的并行实现在具有较高的执行效率的同时也具有较好的灵活性。展开更多
文摘模型深度的不断增加和处理序列长度的不一致对循环神经网络在不同处理器上的性能优化提出巨大挑战。针对自主研制的长向量处理器FT-M7032,实现了一个高效的循环神经网络加速引擎。该引擎采用行优先矩阵向量乘算法和数据感知的多核并行方式,提高矩阵向量乘的计算效率;采用两级内核融合优化方法降低临时数据传输的开销;采用手写汇编优化多种算子,进一步挖掘长向量处理器的性能潜力。实验表明,长向量处理器循环神经网络推理引擎可获得较高性能,相较于多核ARM CPU以及Intel Golden CPU,类循环神经网络模型长短记忆网络可获得最高62.68倍和3.12倍的性能加速。
基金The National Natural Science Foundation of China(No.61176024)
文摘Due to the fact that the register files seriously affect the performance and area of coarse-grained reconfigurable cryptographic processors, an efficient structure of the distributed cross-domain register file is proposed to realize a cryptographic processor with a high performance and a lowarea cost. In order to meet the demands of high performance and high flexibility at a lowarea cost, a union structure with the multi-ports access structure, i, e., a distributed crossdomain register file, is designed by analyzing the algorithm features of different ciphers. Considering different algorithm requirements of the global register files and local register files,the circuit design is realized by adopting different design parameters under TSMC( Taiwan Semiconductor Manufacturing Company) 40 nm CMOS( complementary metal oxide semiconductor) technology and compared with other similar works. The experimental results showthat the proposed distributed cross-domain register structure can effectively improve the performance of the unit area, of which the total performance of block per cycle is improved by17. 79% and performance of block per cycle per area is improved by 117%.
文摘Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill which the iteration is executed asynchronously: Each processor calculate the solution of an individual nonlinear system belong to its nonlinear multisplitting and can update the global approximation residing in the shared memory at any time. A local convergence analysis of this model is presented. Finally, we give a uumerical example which shows a 'strange' property that speedup Sp > p and efficiency Ep > 1.
文摘基于异构众核处理器的超级计算机已经成为TOP500高性能计算机的主流,BSP、LogP、PRAM等已有并行计算模型均针对基于多核处理器的超级计算机设计,不能满足日益迫切的基于众核架构的超级计算机和应用发展需求.本文面向“神威·太湖之光”和神威E级原型系统的众核体系结构特点,提出P-PALN(Parallel-Parallel Access via LDM&NOC)并行计算模型,对于计算节点间的并行,该模型沿用BSP/LogP模型描述;对于计算节点内的众核并行,该模型提供私有存储访问和片上阵列通信的众核并行架构的有效描述PALN,能够协助用户进行众核并行算法设计,并在申威众核处理器硬件设计中指导参数的优化.实验结果表明,该模型可有效指导硬件设计和用户众核编程,从而提高系统和应用的性能.
文摘三维高效视频编码(3D High Efficiency Video Coding,3D-HEVC)中视差估计算法存在处理数据量大、运算时间长和资源消耗大的问题,进一步提高算法执行效率对于3D-HEVC的推广应用具有十分重要的意义。在深入分析视差估计算法的并行性的基础上,基于项目组开发的视频阵列处理器(DPR-CODEC),提出一种新的并行实现方案。在可重构阵列结构中完成了视差估计算法的并行映射、功能仿真和FPGA测试,显著减少了视差估计算法的执行时间。实验结果表明,所提出的并行实现方案相比于串行单PE执行时间节省了59%,基于可编程可重构阵列的并行实现在具有较高的执行效率的同时也具有较好的灵活性。