Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting...Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting single instruction multiple date (SIMD) capacity of processors to improve the performance of tree search, and proposes several improvement methods on reported SIMD tree search algorithms. Based on blocking tree structure, blocking for memory alignment and dynamic blocking prefetch are proposed to optimize the overhead of memory access. Furthermore, as a way of non-linear loop unrolling, the search branch unwinding shows that the number of branches can exceed the data width of SIMD instructions in the SIMD search algorithm. The experiments suggest that blocking optimized SIMD tree search algorithm can achieve 1.6 times response speed faster than the un-optimized algorithm.展开更多
In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the perform...In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the performance of processors and imposes a limitation of the number of processors that can be used efficiently in bus-based systems. Since the multi-processor’s performance depends upon many parameters which affect the performance in different ways, timed Petri nets are used to model shared-memory bus-based multiprocessors at the instruction execution level, and the developed models are used to study how the performance of processors changes with the number of processors in the system. The results illustrate very well the restriction on the number of processors imposed by the shared bus. All performance characteristics presented in this paper are obtained by discrete-event simulation of Petri net models.展开更多
Let A be m by n matrix, M and N be positive definite matrices of order in and n respectively. This paper presents an efficient method for computing (M-N) singular value decomposition((M-N) SVD) of A on a cube connecte...Let A be m by n matrix, M and N be positive definite matrices of order in and n respectively. This paper presents an efficient method for computing (M-N) singular value decomposition((M-N) SVD) of A on a cube connected single instruction stream-multiple data stream(SIMD) parallel computer. This method is based on a one-sided orthogonalization algorithm due to Hestenes. On the cube connected SIMD parallel computer with o(n) processors, the (M -- N) SVD of a matrix A requires a computation time of o(m3 log m/n).展开更多
This letter presents a programmable single-chip architecture for Multi-lnput and Multi-Output (M1MO) OFDM baseband receiver. The architecture comprises a Single Instruction Multiple Data (SIMD) DSP core and three ...This letter presents a programmable single-chip architecture for Multi-lnput and Multi-Output (M1MO) OFDM baseband receiver. The architecture comprises a Single Instruction Multiple Data (SIMD) DSP core and three coprocessors that are used for synchronization, FFT and channel decoder. In this MIMO OFDM system, the Zero Correlation Zone (ZCZ) code is used as the synchronization word preamble of packet in the physical layer in order to avoid the interference from other transmitting antennas. Furthermore, a simple channel estimation algorithm is proposed which is appropriate tbr the SIMD DSP computation.展开更多
Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIM...Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.展开更多
基金Project supported by the Shanghai Leading Academic Discipline Project(Grant No.J50103)the Graduate Student Innovation Foundation of Shanghai University(Grant No.SHUCX112167)
文摘Tree search is a widely used fundamental algorithm. Modern processors provide tremendous computing power by integrating multiple cores, each with a vector processing unit. This paper reviews some studies on exploiting single instruction multiple date (SIMD) capacity of processors to improve the performance of tree search, and proposes several improvement methods on reported SIMD tree search algorithms. Based on blocking tree structure, blocking for memory alignment and dynamic blocking prefetch are proposed to optimize the overhead of memory access. Furthermore, as a way of non-linear loop unrolling, the search branch unwinding shows that the number of branches can exceed the data width of SIMD instructions in the SIMD search algorithm. The experiments suggest that blocking optimized SIMD tree search algorithm can achieve 1.6 times response speed faster than the un-optimized algorithm.
文摘In shared-memory bus-based multiprocessors, when the number of processors grows, the processors spend an increasing amount of time waiting for access to the bus (and shared memory). This contention reduces the performance of processors and imposes a limitation of the number of processors that can be used efficiently in bus-based systems. Since the multi-processor’s performance depends upon many parameters which affect the performance in different ways, timed Petri nets are used to model shared-memory bus-based multiprocessors at the instruction execution level, and the developed models are used to study how the performance of processors changes with the number of processors in the system. The results illustrate very well the restriction on the number of processors imposed by the shared bus. All performance characteristics presented in this paper are obtained by discrete-event simulation of Petri net models.
文摘Let A be m by n matrix, M and N be positive definite matrices of order in and n respectively. This paper presents an efficient method for computing (M-N) singular value decomposition((M-N) SVD) of A on a cube connected single instruction stream-multiple data stream(SIMD) parallel computer. This method is based on a one-sided orthogonalization algorithm due to Hestenes. On the cube connected SIMD parallel computer with o(n) processors, the (M -- N) SVD of a matrix A requires a computation time of o(m3 log m/n).
基金Supported by the National Natural Science Foundation of China (No.60476013).
文摘This letter presents a programmable single-chip architecture for Multi-lnput and Multi-Output (M1MO) OFDM baseband receiver. The architecture comprises a Single Instruction Multiple Data (SIMD) DSP core and three coprocessors that are used for synchronization, FFT and channel decoder. In this MIMO OFDM system, the Zero Correlation Zone (ZCZ) code is used as the synchronization word preamble of packet in the physical layer in order to avoid the interference from other transmitting antennas. Furthermore, a simple channel estimation algorithm is proposed which is appropriate tbr the SIMD DSP computation.
文摘Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.