Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIM...Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.展开更多
In this paper,the cause of bit-error is analyzed when data are decided in the optical receiver.A monolithic D-ff decision circuit is designed.It can work effectively at 622 Mb/s.Moreover,a decision method of parallel ...In this paper,the cause of bit-error is analyzed when data are decided in the optical receiver.A monolithic D-ff decision circuit is designed.It can work effectively at 622 Mb/s.Moreover,a decision method of parallel processing to improve the decision speed is presented,through which the parallel circuit can work up to 1 Gb/s using the same model.With the technique,higher-speed data can be decided by using lower speed device.展开更多
The 32-bit extensible embedded processor RISC3200 originating from an RTL prototype core is intended for low-cost consumer multimedia products. In order to incorporate the reduced instruction set and the multimedia ex...The 32-bit extensible embedded processor RISC3200 originating from an RTL prototype core is intended for low-cost consumer multimedia products. In order to incorporate the reduced instruction set and the multimedia extension instruction set in a unifying pipeline, a scalable super-pipeline technique is adopted. Several other optimization techniques are proposed to boost the frequency and reduce the average CPI of the unifying pipeline. Based on a data flow graph (DFG) with delay information, the critical path of the pipeline stage can be located and shortened. This paper presents a distributed data bypass unit and a centralized pipeline control scheme for achieving lower CPI. Synthesis and simulation showed that the optimization techniques enable RISC3200 to operate at 200 MHz with an average CPI of 1.16. The core was integrated into a media SOC chip taped out in SMIC 0.18-micron technology. Preliminary testing result showed that the processor works well as we expected.展开更多
In this paper, a study related to the expected performance behaviour of present 3-level cache system for multi-core systems is presented. For this a queuing model for present 3-level cache system for multi-core proces...In this paper, a study related to the expected performance behaviour of present 3-level cache system for multi-core systems is presented. For this a queuing model for present 3-level cache system for multi-core processors is developed and its possible performance has been analyzed with the increase in number of cores. Various important performance parameters like access time and utilization of individual cache at different level and overall average access time of the cache system is determined. Results for up to 1024 cores have been reported in this paper.展开更多
By utilizing the capability of high-speed computing,powerful real-time processing of TMS320F2812 DSP,wavelet thresholding denoising algorithm is realized based on Digital Signal Processors.Based on the multi-resolutio...By utilizing the capability of high-speed computing,powerful real-time processing of TMS320F2812 DSP,wavelet thresholding denoising algorithm is realized based on Digital Signal Processors.Based on the multi-resolution analysis of wavelet transformation,this paper proposes a new thresholding function,to some extent,to overcome the shortcomings of discontinuity in hard-thresholding function and bias in soft-thresholding function.The threshold value can be abtained adaptively according to the characteristics of wavelet coefficients of each layer by adopting adaptive threshold algorithm and then the noise is removed.The simulation results show that the improved thresholding function and the adaptive threshold algorithm have a good effect on denoising and meet the criteria of smoothness and similarity between the original signal and denoising signal.展开更多
Real-time task scheduling is of primary significance in multiprocessor systems.Meeting deadlines and achieving high system utilization are the two main objectives of task scheduling in such systems.In this paper,we re...Real-time task scheduling is of primary significance in multiprocessor systems.Meeting deadlines and achieving high system utilization are the two main objectives of task scheduling in such systems.In this paper,we represent those two goals as the minimization of the average response time and the average task laxity.To achieve this,we propose a genetic-based algorithm with problem-specific and efficient genetic operators.Adaptive control parameters are also employed in our work to improve the genetic algorithms' efficiency.The simulation results show that our proposed algorithm outperforms its counterpart considerably by up to 36% and 35% in terms of the average response time and the average task laxity,respectively.展开更多
The demands of programmability have become more and more exigent as novel network services appear, such as E-commerce, social softwares, and online videos. Commodity multi-core CPUs have been widely applied in network...The demands of programmability have become more and more exigent as novel network services appear, such as E-commerce, social softwares, and online videos. Commodity multi-core CPUs have been widely applied in network packet processing to get high programmability and reduce the time-to-market. However,there is a great gap between the packet processing performance of commodity multi-core and that of the traditional packet processing hardware, e.g., NP(Network Process). Recently, optimization of the packet processing performance of commodity multi-cores has become a hot topic in industry and academia. In this paper, based on a detailed analysis of the packet processing procedure, firstly we identify two dominating overheads, namely the virtual-to-physical address translation and the packet buffer management. Secondly, we make a comprehensive survey on the current optimization methods. Thirdly, based on the survey, the heterogeneous architecture of the commodity multi-core + FPGA is proposed as a promising way to improve the packet processing performance.Fourthly, a novel Self-Described Buffer(SDB) management technology is introduced to eliminate the overheads of the allocation and deallocation of the packet buffers offloaded to FPGA. Then, an evaluation testbed, named PIOT(Packet I/O Testbed), is designed and implemented to evaluate the packet forwarding performance. I/O capacity of different commodity multi-core CPUs and the performance of optimization methods are assessed and compared based on PIOT. At last, the future work of packet processing optimization on multi-core CPUs is discussed.展开更多
Precise zero-knowledge was introduced by Micali and Pass in STOC06. This notion captures the idea that the view of a verifier can be reconstructed in almost same time. Following the notion, they constructed some preci...Precise zero-knowledge was introduced by Micali and Pass in STOC06. This notion captures the idea that the view of a verifier can be reconstructed in almost same time. Following the notion, they constructed some precise zero-knowledge proofs and arguments, in which the communicated messages are polynomial bits. In this paper, we employ the new simulation technique introduced by them to provide a precise simulator for a modified Kilian's zero-knowledge arguments with poly-logarithmic efficiency (this modification addressed by Rosen), and as a result we show this protocol is a precise zero-knowledge argument with poly-logaxithmic efficiency. We also present an alternative construction of the desired protocols.展开更多
We present novel vector permutation and branch reduction methods to minimize the number of execution cycles for bit reversal algorithms.The new methods are applied to single instruction multiple data(SIMD) parallel im...We present novel vector permutation and branch reduction methods to minimize the number of execution cycles for bit reversal algorithms.The new methods are applied to single instruction multiple data(SIMD) parallel implementation of complex data floating-point fast Fourier transform(FFT).The number of operational clock cycles can be reduced by an average factor of 3.5 by using our vector permutation methods and by 1.1 by using our branch reduction methods,compared with conventional im-plementations.Experiments on MPC7448(a well-known SIMD reduced instruction set computing processor) demonstrate that our optimal bit-reversal algorithm consistently takes fewer than two cycles per element in complex array operations.展开更多
文摘Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.
文摘In this paper,the cause of bit-error is analyzed when data are decided in the optical receiver.A monolithic D-ff decision circuit is designed.It can work effectively at 622 Mb/s.Moreover,a decision method of parallel processing to improve the decision speed is presented,through which the parallel circuit can work up to 1 Gb/s using the same model.With the technique,higher-speed data can be decided by using lower speed device.
基金Project supported by the Hi-Tech Research and Development Pro-gram (863) of China (No. 2002 AA1Z1140) and the Fork Ying TongEducation Foundation (No. 94031), China
文摘The 32-bit extensible embedded processor RISC3200 originating from an RTL prototype core is intended for low-cost consumer multimedia products. In order to incorporate the reduced instruction set and the multimedia extension instruction set in a unifying pipeline, a scalable super-pipeline technique is adopted. Several other optimization techniques are proposed to boost the frequency and reduce the average CPI of the unifying pipeline. Based on a data flow graph (DFG) with delay information, the critical path of the pipeline stage can be located and shortened. This paper presents a distributed data bypass unit and a centralized pipeline control scheme for achieving lower CPI. Synthesis and simulation showed that the optimization techniques enable RISC3200 to operate at 200 MHz with an average CPI of 1.16. The core was integrated into a media SOC chip taped out in SMIC 0.18-micron technology. Preliminary testing result showed that the processor works well as we expected.
文摘In this paper, a study related to the expected performance behaviour of present 3-level cache system for multi-core systems is presented. For this a queuing model for present 3-level cache system for multi-core processors is developed and its possible performance has been analyzed with the increase in number of cores. Various important performance parameters like access time and utilization of individual cache at different level and overall average access time of the cache system is determined. Results for up to 1024 cores have been reported in this paper.
文摘By utilizing the capability of high-speed computing,powerful real-time processing of TMS320F2812 DSP,wavelet thresholding denoising algorithm is realized based on Digital Signal Processors.Based on the multi-resolution analysis of wavelet transformation,this paper proposes a new thresholding function,to some extent,to overcome the shortcomings of discontinuity in hard-thresholding function and bias in soft-thresholding function.The threshold value can be abtained adaptively according to the characteristics of wavelet coefficients of each layer by adopting adaptive threshold algorithm and then the noise is removed.The simulation results show that the improved thresholding function and the adaptive threshold algorithm have a good effect on denoising and meet the criteria of smoothness and similarity between the original signal and denoising signal.
文摘Real-time task scheduling is of primary significance in multiprocessor systems.Meeting deadlines and achieving high system utilization are the two main objectives of task scheduling in such systems.In this paper,we represent those two goals as the minimization of the average response time and the average task laxity.To achieve this,we propose a genetic-based algorithm with problem-specific and efficient genetic operators.Adaptive control parameters are also employed in our work to improve the genetic algorithms' efficiency.The simulation results show that our proposed algorithm outperforms its counterpart considerably by up to 36% and 35% in terms of the average response time and the average task laxity,respectively.
基金supported by National High-tech R&D Program of China(863 Program)(Grant No.2015AA0156-03)National Natural Science Foundation of China(Grant No.61202483)
文摘The demands of programmability have become more and more exigent as novel network services appear, such as E-commerce, social softwares, and online videos. Commodity multi-core CPUs have been widely applied in network packet processing to get high programmability and reduce the time-to-market. However,there is a great gap between the packet processing performance of commodity multi-core and that of the traditional packet processing hardware, e.g., NP(Network Process). Recently, optimization of the packet processing performance of commodity multi-cores has become a hot topic in industry and academia. In this paper, based on a detailed analysis of the packet processing procedure, firstly we identify two dominating overheads, namely the virtual-to-physical address translation and the packet buffer management. Secondly, we make a comprehensive survey on the current optimization methods. Thirdly, based on the survey, the heterogeneous architecture of the commodity multi-core + FPGA is proposed as a promising way to improve the packet processing performance.Fourthly, a novel Self-Described Buffer(SDB) management technology is introduced to eliminate the overheads of the allocation and deallocation of the packet buffers offloaded to FPGA. Then, an evaluation testbed, named PIOT(Packet I/O Testbed), is designed and implemented to evaluate the packet forwarding performance. I/O capacity of different commodity multi-core CPUs and the performance of optimization methods are assessed and compared based on PIOT. At last, the future work of packet processing optimization on multi-core CPUs is discussed.
基金the National Natural Science Foundation of China (No.60573031)New Century Excellent Talent Program of Education Ministry of China (No.NCET-05-0398)
文摘Precise zero-knowledge was introduced by Micali and Pass in STOC06. This notion captures the idea that the view of a verifier can be reconstructed in almost same time. Following the notion, they constructed some precise zero-knowledge proofs and arguments, in which the communicated messages are polynomial bits. In this paper, we employ the new simulation technique introduced by them to provide a precise simulator for a modified Kilian's zero-knowledge arguments with poly-logarithmic efficiency (this modification addressed by Rosen), and as a result we show this protocol is a precise zero-knowledge argument with poly-logaxithmic efficiency. We also present an alternative construction of the desired protocols.
文摘We present novel vector permutation and branch reduction methods to minimize the number of execution cycles for bit reversal algorithms.The new methods are applied to single instruction multiple data(SIMD) parallel implementation of complex data floating-point fast Fourier transform(FFT).The number of operational clock cycles can be reduced by an average factor of 3.5 by using our vector permutation methods and by 1.1 by using our branch reduction methods,compared with conventional im-plementations.Experiments on MPC7448(a well-known SIMD reduced instruction set computing processor) demonstrate that our optimal bit-reversal algorithm consistently takes fewer than two cycles per element in complex array operations.