摘要
针对BWDSP100体系结构特点,基于循环展开、指令调度以及软件流水等并行优化技术,结合多簇超长指令架构的特点,通过使用超算硬件指令、零开销循环、指令重新编排与并行等方法对BWDSP100数字信号处理函数库中的函数实施并行化,并基于库中原有的顺序版本实现并行优化版本。实验结果表明,在4宏并行化模式下,所有函数加速比达到9以上,90%的函数加速比超过10,平均加速比为11.12。
According to the characteristics of BWDSP100 processor's architecture,this paper presents several practical ways to improve the performance of digital signal transformation functions in Digital Signal Processor( DSP) function library,including using special assembly instructions,instruction-level reordering,zero-overhead looping instruction,Instruction-level Parallelism( ILP),software vectorization and pipelining. It realizes parallel optimization version in library based on the original order version. Experimental results showthat,in four-macro parallel mode,all digital signal transformation functions can achieve 9x speedup,90% functions can achieve 10 x speedup,and 11. 12 x speedup is achieved on average.
出处
《计算机工程》
CAS
CSCD
北大核心
2016年第3期47-52,共6页
Computer Engineering
基金
高等学校学科创新引智计划基金资助项目(B07033)
安徽省自然科学基金资助项目"基于GPU集群的深度神经网络并行部署和优化策略研究"(1408085MKL06)
关键词
超长指令字
单指令流多数据流
数字信号处理器
循环展开
并行化
多簇
Very Long Instruction Word(VLIW)
Single Instruction Multiple Data(SIMD)
Digital Signal Processor(DSP)
loop unrolling
parallelization
multicluster