期刊文献+
共找到12篇文章
< 1 >
每页显示 20 50 100
Blocking optimized SIMD tree search on modern processors 被引量:2
1
作者 张倬 陆宇凡 +2 位作者 沈文枫 徐炜民 郑衍衡 《Journal of Shanghai University(English Edition)》 CAS 2011年第5期437-444,共8页
Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting... Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting single instruction multiple date (SIMD) capacity of processors to improve the performance of tree search, and proposes several improvement methods on reported SIMD tree search algorithms. Based on blocking tree structure, blocking for memory alignment and dynamic blocking prefetch are proposed to optimize the overhead of memory access. Furthermore, as a way of non-linear loop unrolling, the search branch unwinding shows that the number of branches can exceed the data width of SIMD instructions in the SIMD search algorithm. The experiments suggest that blocking optimized SIMD tree search algorithm can achieve 1.6 times response speed faster than the un-optimized algorithm. 展开更多
关键词 single instruction multiple date simd tree search binary search streaming simd extensions (SSE) Cell broadband engine (BE)
下载PDF
ALGORITHMS AND ARCHITECTURE IMPLEMENTATIONS OF MIMO OFDM BASEBAND RECEIVER BASED ON THE SIMD DSP CORE 被引量:1
2
作者 Hao Xuefei Chen Jie +1 位作者 Zhao Danfeng Zhou Chaoxian 《Journal of Electronics(China)》 2006年第5期763-768,共6页
This letter presents a programmable single-chip architecture for Multi-lnput and Multi-Output (M1MO) OFDM baseband receiver. The architecture comprises a Single Instruction Multiple Data (SIMD) DSP core and three ... This letter presents a programmable single-chip architecture for Multi-lnput and Multi-Output (M1MO) OFDM baseband receiver. The architecture comprises a Single Instruction Multiple Data (SIMD) DSP core and three coprocessors that are used for synchronization, FFT and channel decoder. In this MIMO OFDM system, the Zero Correlation Zone (ZCZ) code is used as the synchronization word preamble of packet in the physical layer in order to avoid the interference from other transmitting antennas. Furthermore, a simple channel estimation algorithm is proposed which is appropriate tbr the SIMD DSP computation. 展开更多
关键词 Multi-Input and Multi-Output (MIMO) OFDM Baseband receiver Zero Correlation Zone (ZCZ) code Single Instruction Multiple Data simd DSP
下载PDF
Efficient SIMD optimization for media processors
3
作者 Jian-peng ZHOU Ce SHI 《Journal of Zhejiang University-Science A(Applied Physics & Engineering)》 SCIE EI CAS CSCD 2008年第4期524-530,共7页
Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIM... Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained. 展开更多
关键词 Retargetable compiler Single instruction multiple data simd instruction LCC
下载PDF
A TSE based design for MMSE and QRD of MIMO systems based on ASIP
4
作者 冯雪林 SHI Jinglin +3 位作者 CHEN Yang FU Yanlu ZHANG Qineng XIAO Feng 《High Technology Letters》 EI CAS 2023年第2期166-173,共8页
A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set process... A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set processor(ASIP), which uses TSE algorithm instead of resource-consuming reciprocal and reciprocal square root(RSR) operations.The aim is to give a high performance implementation for MMSE and QRD in one programmable platform simultaneously.Furthermore, instruction set architecture(ISA) and the allocation of data paths in single instruction multiple data-very long instruction word(SIMD-VLIW) architecture are provided, offering more data parallelism and instruction parallelism for different dimension matrices and operation types.Meanwhile, multiple level numerical precision can be achieved with flexible table size and expansion order in TSE ISA.The ASIP has been implemented to a 28 nm CMOS process and frequency reaches 800 MHz.Experimental results show that the proposed design provides perfect numerical precision within the fixed bit-width of the ASIP, higher matrix processing rate better than the requirements of 5G system and more rate-area efficiency comparable with ASIC implementations. 展开更多
关键词 multi-input and multi-output(MIMO) minimum mean-square error(MMSE) QR decomposition(QRD) Taylor series expansion(TSE) application specific instruction set processor(ASIP) instruction set architecture(ISA) single instruction multiple data(simd) very long instruction word(VLIW)
下载PDF
Combining Task Scheduling in Power Adaptive Dynamic Reconfigurable System 被引量:2
5
作者 Hui Dong Le-Tian Huang +1 位作者 Jun-Shi Wang Terrence Mak 《Journal of Electronic Science and Technology》 CAS 2012年第4期296-301,共6页
Supplying the electronic equipment by exploiting ambient energy sources is a hot spot. In order to achieve the match between power supply and demands under the variance of environments at real time, a reconfigurable t... Supplying the electronic equipment by exploiting ambient energy sources is a hot spot. In order to achieve the match between power supply and demands under the variance of environments at real time, a reconfigurable technique is taken. In this paper, a dynamic power consumption model by using a lookup table as a unit is proposed. Then, we establish a system-level task scheduling model according to the task type. Based on single instruction multiple data (SIMD) architecture which contains a processing system and a control system with a Nios II processor, a practical dynamic reconfigurable system is built. The approach is evaluated on a hardware platform. The test results show that the system can automatically adjust the power consumption in case of external energy input changing. The utilization of the system dynamic power of their portion is from 80.05% to 91.75% during the first task assignment. During the entire processing cycle, the total energy efficiency is 97.67%. 展开更多
关键词 Nios II power adaptive recon-figuration single instruction multiple data simd taskscheduling model.
下载PDF
A parallel memory architecture for video coding
6
作者 Jian-ying PENG Xiao-lang YAN +1 位作者 De-xian LI Li-zhong CHEN 《Journal of Zhejiang University-Science A(Applied Physics & Engineering)》 SCIE EI CAS CSCD 2008年第12期1644-1655,共12页
To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel ske... To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel skewing schemes to provide conflict-free access to adjacent elements (8-bit and 16-bit data types) or with power-of-two intervals in both horizontal and vertical directions, which were not possible in previous parallel memory architectures. Area consumptions and delay estimations are given respectively with 4, 8 and 16 memory modules. Under a 0.18-pm CMOS technology, the synthesis results show that the proposed system can achieve 230 MHz clock frequency with 16 memory modules at the cost of 19k gates when read and write latencies are 3 and 2 clock cycles, respectively. We implement the proposed parallel memory architecture on a video signal processor (VSP). The results show that VSP enhanced with the proposed architecture achieves 1.28× speedups for H.264 real-time decoding. 展开更多
关键词 Single instruction multiple data simd Video coding Parallel memory Skewing scheme
下载PDF
Hardware-Software Co-implementation of H.264 Decoder in SoC
7
作者 杨宇红 张文军 +1 位作者 熊恋学 饶振宁 《Journal of Shanghai Jiaotong university(Science)》 EI 2006年第3期335-339,共5页
With the increasing demand for flexible and efficient implementation of image and video processing algorithms, there should be a good tradeoff between hardware and software design method. This paper utilized the HW-SW... With the increasing demand for flexible and efficient implementation of image and video processing algorithms, there should be a good tradeoff between hardware and software design method. This paper utilized the HW-SW codesign method to implement the H.264 decoder in an SoC with an ARM core, a multimedia processor and a deblocking filter coprocessor. For the parallel processing features of the multimedia processor, clock cycles of decoding process can be dramatically reduced. And the hardware dedicated deblocking filter coprocessor can improve the efficiency a lot. With maximum clock frequency of 150 MHz, the whole system can achieve real time processing speed and flexibility. 展开更多
关键词 HW-SW co-implementation single instruction multiple data simd multimedia processor H.264 decoder COPROCESSOR
下载PDF
Sorting Data Elements by SOCD Using Centralized Diamond Architecture
8
作者 Masumeh Damrudi Kamal Jadidy Aval 《Computer Technology and Application》 2011年第5期374-377,共4页
Several parallel sorting techniques on different architectures have been studied for many years. Due to the need for faster systems in today's world, parallelism can be used to accelerate applications. Nowadays, para... Several parallel sorting techniques on different architectures have been studied for many years. Due to the need for faster systems in today's world, parallelism can be used to accelerate applications. Nowadays, parallel operations are used to solve computer problems such as sort and search, which result in a reasonable speed. Sorting is one of the most important operations in computing world. The authors always try to find the best in different areas which the premier is speedup. In this paper, the authors issued a sort with O(logn) time complexity on PRAM EREW (Parallel Random Access Machine Exclusive Read Exclusive Write). The algorithm is designed in a manner that keeps the tradeoff between the number of processor elements in the architecture and execution time. The simulation of the algorithm proves the theoretical analysis of the algorithm. The results of this research can be utilized in developing faster embedded systems. Sorting on Centralized Diamond (SOCD) algorithm is issued on the novel Centralized Diamond architecture which takes the advantages of Single Instruction Multiple Data (SIMD) architecture. This architecture and the sort on it are intuitive and optimal. 展开更多
关键词 Parallel sorting diamond architecture single instruction multiple data simd parallel random access machine exclusive read exclusive write (PRAM EREW) sorting on centralized diamond (SOCD).
下载PDF
HXPY: A High-Performance Data Processing Package for Financial Time-Series Data
9
作者 郭家栋 彭靖姝 +1 位作者 苑航 倪明选 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第1期3-24,共22页
A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the succ... A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the successful adoption of machine learning models on financial data,where the importance of accuracy and timeliness demands highly effective computing frameworks.However,traditional financial time-series data processing frameworks have shown performance degradation and adaptation issues,such as the outlier handling with stock suspension in Pandas and TA-Lib.In this paper,we propose HXPY,a high-performance data processing package with a C++/Python interface for financial time-series data.HXPY supports miscellaneous acceleration techniques such as the streaming algorithm,the vectorization instruction set,and memory optimization,together with various functions such as time window functions,group operations,down-sampling operations,cross-section operations,row-wise or column-wise operations,shape transformations,and alignment functions.The results of benchmark and incremental analysis demonstrate the superior performance of HXPY compared with its counterparts.From MiBs to GiBs data,HXPY significantly outperforms other in-memory dataframe computing rivals even up to hundreds of times. 展开更多
关键词 dataframe time-series data simd(single instruction multiple data) CUDA(Compute Unified Device Architecture)
原文传递
Evaluating RISC-V Vector Instruction Set Architecture Extension with Computer Vision Workloads
10
作者 李若时 彭平 +2 位作者 邵志远 金海 郑然 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第4期807-820,共14页
Computer vision(CV)algorithms have been extensively used for a myriad of applications nowadays.As the multimedia data are generally well-formatted and regular,it is beneficial to leverage the massive parallel processi... Computer vision(CV)algorithms have been extensively used for a myriad of applications nowadays.As the multimedia data are generally well-formatted and regular,it is beneficial to leverage the massive parallel processing power of the underlying platform to improve the performances of CV algorithms.Single Instruction Multiple Data(SIMD)instructions,capable of conducting the same operation on multiple data items in a single instruction,are extensively employed to improve the efficiency of CV algorithms.In this paper,we evaluate the power and effectiveness of RISC-V vector extension(RV-V)on typical CV algorithms,such as Gray Scale,Mean Filter,and Edge Detection.By our examinations,we show that compared with the baseline OpenCV implementation using scalar instructions,the equivalent implementations using the RV-V(version 0.8)can reduce the instruction count of the same CV algorithm up to 24x,when processing the same input images.Whereas,the actual performances improvement measured by the cycle counts is highly related with the specific implementation of the underlying RV-V co-processor.In our evaluation,by using the vector co-processor(with eight execution lanes)of Xuantie C906,vector-version CV algorithms averagely exhibit up to 2.98x performances speedups compared with their scalar counterparts. 展开更多
关键词 RISC-V vector extension single instruction multiple data(simd) computer vision OpenCV
原文传递
Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units
11
作者 LI Bingchao WEI Jizeng +1 位作者 GUO Wei SUN Jizhou 《Journal of Shanghai Jiaotong university(Science)》 EI 2021年第2期245-256,共12页
Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a war... Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows. 展开更多
关键词 graphics processing unit(GPU) single instruction ultiple data(simd) THREAD warps BYPASS
原文传递
Novel algorithm for complex bit reversal:employing vector permutation and branch reduction methods
12
作者 Feng YU Ze-ke WANG Rui-feng GE 《Journal of Zhejiang University-Science A(Applied Physics & Engineering)》 SCIE EI CAS CSCD 2009年第10期1492-1499,共8页
We present novel vector permutation and branch reduction methods to minimize the number of execution cycles for bit reversal algorithms.The new methods are applied to single instruction multiple data(SIMD) parallel im... We present novel vector permutation and branch reduction methods to minimize the number of execution cycles for bit reversal algorithms.The new methods are applied to single instruction multiple data(SIMD) parallel implementation of complex data floating-point fast Fourier transform(FFT).The number of operational clock cycles can be reduced by an average factor of 3.5 by using our vector permutation methods and by 1.1 by using our branch reduction methods,compared with conventional im-plementations.Experiments on MPC7448(a well-known SIMD reduced instruction set computing processor) demonstrate that our optimal bit-reversal algorithm consistently takes fewer than two cycles per element in complex array operations. 展开更多
关键词 Bit reversal Vector permutation Branch reduction Single instruction multiple data simd Fast Fourier transform (FFT)
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部