As an emerging joint learning model,federated learning is a promising way to combine model parameters of different users for training and inference without collecting users’original data.However,a practical and effic...As an emerging joint learning model,federated learning is a promising way to combine model parameters of different users for training and inference without collecting users’original data.However,a practical and efficient solution has not been established in previous work due to the absence of efficient matrix computation and cryptography schemes in the privacy-preserving federated learning model,especially in partially homomorphic cryptosystems.In this paper,we propose a Practical and Efficient Privacy-preserving Federated Learning(PEPFL)framework.First,we present a lifted distributed ElGamal cryptosystem for federated learning,which can solve the multi-key problem in federated learning.Secondly,we develop a Practical Partially Single Instruction Multiple Data(PSIMD)parallelism scheme that can encode a plaintext matrix into single plaintext for encryption,improving the encryption efficiency and reducing the communication cost in partially homomorphic cryptosystem.In addition,based on the Convolutional Neural Network(CNN)and the designed cryptosystem,a novel privacy-preserving federated learning framework is designed by using Momentum Gradient Descent(MGD).Finally,we evaluate the security and performance of PEPFL.The experiment results demonstrate that the scheme is practicable,effective,and secure with low communication and computation costs.展开更多
An effective method is introduced to compensate the effects of mutual coupling for the Estimation of Signal Parameter via Rotational Invariance Techniques (ESPRIT) direction finding algorithm in application of signal ...An effective method is introduced to compensate the effects of mutual coupling for the Estimation of Signal Parameter via Rotational Invariance Techniques (ESPRIT) direction finding algorithm in application of signal snapshot array processing.Changing the covariance matrix into a Teoplitz matrix can achieve high resolution in the Direction Of Arrive (DOA) estimation.How the mutual coupling affects the array antennas has been discussed and a new definition of mutual im- pedance has been used to characterize the mutual coupling effects between the array elements.Based on the new mutual impedance matrix,a practical method is presented to eliminate the effects of mutual coupling for ESPRIT in the single snapshot data processing.The simulation results show that, this new method not only properly reduces the effects of mutual coupling,but also maintains its steady performance even for weak signals.展开更多
A single column model (SCM) is constructed by extracting the physical subroutines from the NCAR Community Climate Model version 1 (CCM1).Simulated data are generated by CCM1 and used to validate the SCM and to study t...A single column model (SCM) is constructed by extracting the physical subroutines from the NCAR Community Climate Model version 1 (CCM1).Simulated data are generated by CCM1 and used to validate the SCM and to study the sensitivity of the SCM to errors in its input data.It is found that the SCM temperature predictions are moderately sensitive to errors in the input horizontal temperature flux convergence and moisture flux convergence.Two types of error are concerned in this study,random errors due to insufficient data resolution,and errors due to insufficient data area coverage.While the first type of error can be reduced by filtering and/or increasing the data resolution,it is shown that the second type of error can be reduced by enlarging the data area coverage and using a suitable method to compute the input flux convergence terms.展开更多
Several parallel sorting techniques on different architectures have been studied for many years. Due to the need for faster systems in today's world, parallelism can be used to accelerate applications. Nowadays, para...Several parallel sorting techniques on different architectures have been studied for many years. Due to the need for faster systems in today's world, parallelism can be used to accelerate applications. Nowadays, parallel operations are used to solve computer problems such as sort and search, which result in a reasonable speed. Sorting is one of the most important operations in computing world. The authors always try to find the best in different areas which the premier is speedup. In this paper, the authors issued a sort with O(logn) time complexity on PRAM EREW (Parallel Random Access Machine Exclusive Read Exclusive Write). The algorithm is designed in a manner that keeps the tradeoff between the number of processor elements in the architecture and execution time. The simulation of the algorithm proves the theoretical analysis of the algorithm. The results of this research can be utilized in developing faster embedded systems. Sorting on Centralized Diamond (SOCD) algorithm is issued on the novel Centralized Diamond architecture which takes the advantages of Single Instruction Multiple Data (SIMD) architecture. This architecture and the sort on it are intuitive and optimal.展开更多
A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set process...A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set processor(ASIP), which uses TSE algorithm instead of resource-consuming reciprocal and reciprocal square root(RSR) operations.The aim is to give a high performance implementation for MMSE and QRD in one programmable platform simultaneously.Furthermore, instruction set architecture(ISA) and the allocation of data paths in single instruction multiple data-very long instruction word(SIMD-VLIW) architecture are provided, offering more data parallelism and instruction parallelism for different dimension matrices and operation types.Meanwhile, multiple level numerical precision can be achieved with flexible table size and expansion order in TSE ISA.The ASIP has been implemented to a 28 nm CMOS process and frequency reaches 800 MHz.Experimental results show that the proposed design provides perfect numerical precision within the fixed bit-width of the ASIP, higher matrix processing rate better than the requirements of 5G system and more rate-area efficiency comparable with ASIC implementations.展开更多
Through the comparison of various acquisition technology and related technology theory of the existing scheme, the paper analyze and design the power battery testing platform of the data acquisition system, and give t...Through the comparison of various acquisition technology and related technology theory of the existing scheme, the paper analyze and design the power battery testing platform of the data acquisition system, and give the research design scheme of the utility model through the design of the software on PC and CAN bus, which makes the full synchronization requirements acquisition unit; improve the linearity and stability of total voltage and current acquisition by the integrated circuit, and improve the system sampling rate, effectively complete the corresponding index. Finally, through experimental verification, to ensure the completion of the technical indicators.展开更多
Supplying the electronic equipment by exploiting ambient energy sources is a hot spot. In order to achieve the match between power supply and demands under the variance of environments at real time, a reconfigurable t...Supplying the electronic equipment by exploiting ambient energy sources is a hot spot. In order to achieve the match between power supply and demands under the variance of environments at real time, a reconfigurable technique is taken. In this paper, a dynamic power consumption model by using a lookup table as a unit is proposed. Then, we establish a system-level task scheduling model according to the task type. Based on single instruction multiple data (SIMD) architecture which contains a processing system and a control system with a Nios II processor, a practical dynamic reconfigurable system is built. The approach is evaluated on a hardware platform. The test results show that the system can automatically adjust the power consumption in case of external energy input changing. The utilization of the system dynamic power of their portion is from 80.05% to 91.75% during the first task assignment. During the entire processing cycle, the total energy efficiency is 97.67%.展开更多
This letter presents a programmable single-chip architecture for Multi-lnput and Multi-Output (M1MO) OFDM baseband receiver. The architecture comprises a Single Instruction Multiple Data (SIMD) DSP core and three ...This letter presents a programmable single-chip architecture for Multi-lnput and Multi-Output (M1MO) OFDM baseband receiver. The architecture comprises a Single Instruction Multiple Data (SIMD) DSP core and three coprocessors that are used for synchronization, FFT and channel decoder. In this MIMO OFDM system, the Zero Correlation Zone (ZCZ) code is used as the synchronization word preamble of packet in the physical layer in order to avoid the interference from other transmitting antennas. Furthermore, a simple channel estimation algorithm is proposed which is appropriate tbr the SIMD DSP computation.展开更多
To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel ske...To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel skewing schemes to provide conflict-free access to adjacent elements (8-bit and 16-bit data types) or with power-of-two intervals in both horizontal and vertical directions, which were not possible in previous parallel memory architectures. Area consumptions and delay estimations are given respectively with 4, 8 and 16 memory modules. Under a 0.18-pm CMOS technology, the synthesis results show that the proposed system can achieve 230 MHz clock frequency with 16 memory modules at the cost of 19k gates when read and write latencies are 3 and 2 clock cycles, respectively. We implement the proposed parallel memory architecture on a video signal processor (VSP). The results show that VSP enhanced with the proposed architecture achieves 1.28× speedups for H.264 real-time decoding.展开更多
Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIM...Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.展开更多
With the increasing demand for flexible and efficient implementation of image and video processing algorithms, there should be a good tradeoff between hardware and software design method. This paper utilized the HW-SW...With the increasing demand for flexible and efficient implementation of image and video processing algorithms, there should be a good tradeoff between hardware and software design method. This paper utilized the HW-SW codesign method to implement the H.264 decoder in an SoC with an ARM core, a multimedia processor and a deblocking filter coprocessor. For the parallel processing features of the multimedia processor, clock cycles of decoding process can be dramatically reduced. And the hardware dedicated deblocking filter coprocessor can improve the efficiency a lot. With maximum clock frequency of 150 MHz, the whole system can achieve real time processing speed and flexibility.展开更多
A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the succ...A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the successful adoption of machine learning models on financial data,where the importance of accuracy and timeliness demands highly effective computing frameworks.However,traditional financial time-series data processing frameworks have shown performance degradation and adaptation issues,such as the outlier handling with stock suspension in Pandas and TA-Lib.In this paper,we propose HXPY,a high-performance data processing package with a C++/Python interface for financial time-series data.HXPY supports miscellaneous acceleration techniques such as the streaming algorithm,the vectorization instruction set,and memory optimization,together with various functions such as time window functions,group operations,down-sampling operations,cross-section operations,row-wise or column-wise operations,shape transformations,and alignment functions.The results of benchmark and incremental analysis demonstrate the superior performance of HXPY compared with its counterparts.From MiBs to GiBs data,HXPY significantly outperforms other in-memory dataframe computing rivals even up to hundreds of times.展开更多
Computer vision(CV)algorithms have been extensively used for a myriad of applications nowadays.As the multimedia data are generally well-formatted and regular,it is beneficial to leverage the massive parallel processi...Computer vision(CV)algorithms have been extensively used for a myriad of applications nowadays.As the multimedia data are generally well-formatted and regular,it is beneficial to leverage the massive parallel processing power of the underlying platform to improve the performances of CV algorithms.Single Instruction Multiple Data(SIMD)instructions,capable of conducting the same operation on multiple data items in a single instruction,are extensively employed to improve the efficiency of CV algorithms.In this paper,we evaluate the power and effectiveness of RISC-V vector extension(RV-V)on typical CV algorithms,such as Gray Scale,Mean Filter,and Edge Detection.By our examinations,we show that compared with the baseline OpenCV implementation using scalar instructions,the equivalent implementations using the RV-V(version 0.8)can reduce the instruction count of the same CV algorithm up to 24x,when processing the same input images.Whereas,the actual performances improvement measured by the cycle counts is highly related with the specific implementation of the underlying RV-V co-processor.In our evaluation,by using the vector co-processor(with eight execution lanes)of Xuantie C906,vector-version CV algorithms averagely exhibit up to 2.98x performances speedups compared with their scalar counterparts.展开更多
We present novel vector permutation and branch reduction methods to minimize the number of execution cycles for bit reversal algorithms.The new methods are applied to single instruction multiple data(SIMD) parallel im...We present novel vector permutation and branch reduction methods to minimize the number of execution cycles for bit reversal algorithms.The new methods are applied to single instruction multiple data(SIMD) parallel implementation of complex data floating-point fast Fourier transform(FFT).The number of operational clock cycles can be reduced by an average factor of 3.5 by using our vector permutation methods and by 1.1 by using our branch reduction methods,compared with conventional im-plementations.Experiments on MPC7448(a well-known SIMD reduced instruction set computing processor) demonstrate that our optimal bit-reversal algorithm consistently takes fewer than two cycles per element in complex array operations.展开更多
Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a war...Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows.展开更多
Usually a single MPEG2 video encoder chip realizes the multiple post stage tasks of motion estimation, such as motion vector refinement and prediction error generation, using multiple hardware modules. This paper p...Usually a single MPEG2 video encoder chip realizes the multiple post stage tasks of motion estimation, such as motion vector refinement and prediction error generation, using multiple hardware modules. This paper proposes a new architecture using only a single module to implement the post stage tasks of motion estimation, which has a single instruction stream over multiple data streams (SIMD). The new architecture is simple and more regular; capable of providing sufficient computational power and of adapting to the encoding flexibility required by the MPEG2 standard. Therefore, it is a more suitable architecture for the system on a chip. NEL Corporation (NTT Electronics, Japan) has integrated a circuit based on this architecture into the single MPEG2 MP@ML encoder chip, which uses the multiresolution telescopic search motion estimation algorithm. Using 0.25 μm CMOS, four metal layer technology, this circuit has 15.4 M gates with an area of about 29 mm 2. The operating clock frequency is 81 MHz.展开更多
基金supported by the National Natural Science Foundation of China under Grant No.U19B2021the Key Research and Development Program of Shaanxi under Grant No.2020ZDLGY08-04+1 种基金the Key Technologies R&D Program of He’nan Province under Grant No.212102210084the Innovation Scientists and Technicians Troop Construction Projects of Henan Province.
文摘As an emerging joint learning model,federated learning is a promising way to combine model parameters of different users for training and inference without collecting users’original data.However,a practical and efficient solution has not been established in previous work due to the absence of efficient matrix computation and cryptography schemes in the privacy-preserving federated learning model,especially in partially homomorphic cryptosystems.In this paper,we propose a Practical and Efficient Privacy-preserving Federated Learning(PEPFL)framework.First,we present a lifted distributed ElGamal cryptosystem for federated learning,which can solve the multi-key problem in federated learning.Secondly,we develop a Practical Partially Single Instruction Multiple Data(PSIMD)parallelism scheme that can encode a plaintext matrix into single plaintext for encryption,improving the encryption efficiency and reducing the communication cost in partially homomorphic cryptosystem.In addition,based on the Convolutional Neural Network(CNN)and the designed cryptosystem,a novel privacy-preserving federated learning framework is designed by using Momentum Gradient Descent(MGD).Finally,we evaluate the security and performance of PEPFL.The experiment results demonstrate that the scheme is practicable,effective,and secure with low communication and computation costs.
文摘An effective method is introduced to compensate the effects of mutual coupling for the Estimation of Signal Parameter via Rotational Invariance Techniques (ESPRIT) direction finding algorithm in application of signal snapshot array processing.Changing the covariance matrix into a Teoplitz matrix can achieve high resolution in the Direction Of Arrive (DOA) estimation.How the mutual coupling affects the array antennas has been discussed and a new definition of mutual im- pedance has been used to characterize the mutual coupling effects between the array elements.Based on the new mutual impedance matrix,a practical method is presented to eliminate the effects of mutual coupling for ESPRIT in the single snapshot data processing.The simulation results show that, this new method not only properly reduces the effects of mutual coupling,but also maintains its steady performance even for weak signals.
文摘A single column model (SCM) is constructed by extracting the physical subroutines from the NCAR Community Climate Model version 1 (CCM1).Simulated data are generated by CCM1 and used to validate the SCM and to study the sensitivity of the SCM to errors in its input data.It is found that the SCM temperature predictions are moderately sensitive to errors in the input horizontal temperature flux convergence and moisture flux convergence.Two types of error are concerned in this study,random errors due to insufficient data resolution,and errors due to insufficient data area coverage.While the first type of error can be reduced by filtering and/or increasing the data resolution,it is shown that the second type of error can be reduced by enlarging the data area coverage and using a suitable method to compute the input flux convergence terms.
文摘Several parallel sorting techniques on different architectures have been studied for many years. Due to the need for faster systems in today's world, parallelism can be used to accelerate applications. Nowadays, parallel operations are used to solve computer problems such as sort and search, which result in a reasonable speed. Sorting is one of the most important operations in computing world. The authors always try to find the best in different areas which the premier is speedup. In this paper, the authors issued a sort with O(logn) time complexity on PRAM EREW (Parallel Random Access Machine Exclusive Read Exclusive Write). The algorithm is designed in a manner that keeps the tradeoff between the number of processor elements in the architecture and execution time. The simulation of the algorithm proves the theoretical analysis of the algorithm. The results of this research can be utilized in developing faster embedded systems. Sorting on Centralized Diamond (SOCD) algorithm is issued on the novel Centralized Diamond architecture which takes the advantages of Single Instruction Multiple Data (SIMD) architecture. This architecture and the sort on it are intuitive and optimal.
基金Supported by the Industrial Internet Innovation and Development Project of Ministry of Industry and Information Technology (No.GHBJ2004)。
文摘A Taylor series expansion(TSE) based design for minimum mean-square error(MMSE) and QR decomposition(QRD) of multi-input and multi-output(MIMO) systems is proposed based on application specific instruction set processor(ASIP), which uses TSE algorithm instead of resource-consuming reciprocal and reciprocal square root(RSR) operations.The aim is to give a high performance implementation for MMSE and QRD in one programmable platform simultaneously.Furthermore, instruction set architecture(ISA) and the allocation of data paths in single instruction multiple data-very long instruction word(SIMD-VLIW) architecture are provided, offering more data parallelism and instruction parallelism for different dimension matrices and operation types.Meanwhile, multiple level numerical precision can be achieved with flexible table size and expansion order in TSE ISA.The ASIP has been implemented to a 28 nm CMOS process and frequency reaches 800 MHz.Experimental results show that the proposed design provides perfect numerical precision within the fixed bit-width of the ASIP, higher matrix processing rate better than the requirements of 5G system and more rate-area efficiency comparable with ASIC implementations.
文摘Through the comparison of various acquisition technology and related technology theory of the existing scheme, the paper analyze and design the power battery testing platform of the data acquisition system, and give the research design scheme of the utility model through the design of the software on PC and CAN bus, which makes the full synchronization requirements acquisition unit; improve the linearity and stability of total voltage and current acquisition by the integrated circuit, and improve the system sampling rate, effectively complete the corresponding index. Finally, through experimental verification, to ensure the completion of the technical indicators.
基金supported by the National Natural Science Foundation of China under Grant No. 61176025 and No. 61006027the Fundamental Research Funds for the Central Universities under Grant No.ZYGX2012J003+1 种基金National Laboratory of Analogue Integrated Circuit Grants under Grants No. 9140C0901101002 and No. 9140C0901101003New Century Excellent Talents Program under Grant No.NCET-10-0297
文摘Supplying the electronic equipment by exploiting ambient energy sources is a hot spot. In order to achieve the match between power supply and demands under the variance of environments at real time, a reconfigurable technique is taken. In this paper, a dynamic power consumption model by using a lookup table as a unit is proposed. Then, we establish a system-level task scheduling model according to the task type. Based on single instruction multiple data (SIMD) architecture which contains a processing system and a control system with a Nios II processor, a practical dynamic reconfigurable system is built. The approach is evaluated on a hardware platform. The test results show that the system can automatically adjust the power consumption in case of external energy input changing. The utilization of the system dynamic power of their portion is from 80.05% to 91.75% during the first task assignment. During the entire processing cycle, the total energy efficiency is 97.67%.
基金Supported by the National Natural Science Foundation of China (No.60476013).
文摘This letter presents a programmable single-chip architecture for Multi-lnput and Multi-Output (M1MO) OFDM baseband receiver. The architecture comprises a Single Instruction Multiple Data (SIMD) DSP core and three coprocessors that are used for synchronization, FFT and channel decoder. In this MIMO OFDM system, the Zero Correlation Zone (ZCZ) code is used as the synchronization word preamble of packet in the physical layer in order to avoid the interference from other transmitting antennas. Furthermore, a simple channel estimation algorithm is proposed which is appropriate tbr the SIMD DSP computation.
基金Project (No. 2005AA1Z1271) supported by the Hi-Tech Research and Development Program (863) of China
文摘To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel skewing schemes to provide conflict-free access to adjacent elements (8-bit and 16-bit data types) or with power-of-two intervals in both horizontal and vertical directions, which were not possible in previous parallel memory architectures. Area consumptions and delay estimations are given respectively with 4, 8 and 16 memory modules. Under a 0.18-pm CMOS technology, the synthesis results show that the proposed system can achieve 230 MHz clock frequency with 16 memory modules at the cost of 19k gates when read and write latencies are 3 and 2 clock cycles, respectively. We implement the proposed parallel memory architecture on a video signal processor (VSP). The results show that VSP enhanced with the proposed architecture achieves 1.28× speedups for H.264 real-time decoding.
文摘Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.
文摘With the increasing demand for flexible and efficient implementation of image and video processing algorithms, there should be a good tradeoff between hardware and software design method. This paper utilized the HW-SW codesign method to implement the H.264 decoder in an SoC with an ARM core, a multimedia processor and a deblocking filter coprocessor. For the parallel processing features of the multimedia processor, clock cycles of decoding process can be dramatically reduced. And the hardware dedicated deblocking filter coprocessor can improve the efficiency a lot. With maximum clock frequency of 150 MHz, the whole system can achieve real time processing speed and flexibility.
文摘A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the successful adoption of machine learning models on financial data,where the importance of accuracy and timeliness demands highly effective computing frameworks.However,traditional financial time-series data processing frameworks have shown performance degradation and adaptation issues,such as the outlier handling with stock suspension in Pandas and TA-Lib.In this paper,we propose HXPY,a high-performance data processing package with a C++/Python interface for financial time-series data.HXPY supports miscellaneous acceleration techniques such as the streaming algorithm,the vectorization instruction set,and memory optimization,together with various functions such as time window functions,group operations,down-sampling operations,cross-section operations,row-wise or column-wise operations,shape transformations,and alignment functions.The results of benchmark and incremental analysis demonstrate the superior performance of HXPY compared with its counterparts.From MiBs to GiBs data,HXPY significantly outperforms other in-memory dataframe computing rivals even up to hundreds of times.
基金supported by the National Natural Science Foundation of China under Grant No.61972444。
文摘Computer vision(CV)algorithms have been extensively used for a myriad of applications nowadays.As the multimedia data are generally well-formatted and regular,it is beneficial to leverage the massive parallel processing power of the underlying platform to improve the performances of CV algorithms.Single Instruction Multiple Data(SIMD)instructions,capable of conducting the same operation on multiple data items in a single instruction,are extensively employed to improve the efficiency of CV algorithms.In this paper,we evaluate the power and effectiveness of RISC-V vector extension(RV-V)on typical CV algorithms,such as Gray Scale,Mean Filter,and Edge Detection.By our examinations,we show that compared with the baseline OpenCV implementation using scalar instructions,the equivalent implementations using the RV-V(version 0.8)can reduce the instruction count of the same CV algorithm up to 24x,when processing the same input images.Whereas,the actual performances improvement measured by the cycle counts is highly related with the specific implementation of the underlying RV-V co-processor.In our evaluation,by using the vector co-processor(with eight execution lanes)of Xuantie C906,vector-version CV algorithms averagely exhibit up to 2.98x performances speedups compared with their scalar counterparts.
文摘We present novel vector permutation and branch reduction methods to minimize the number of execution cycles for bit reversal algorithms.The new methods are applied to single instruction multiple data(SIMD) parallel implementation of complex data floating-point fast Fourier transform(FFT).The number of operational clock cycles can be reduced by an average factor of 3.5 by using our vector permutation methods and by 1.1 by using our branch reduction methods,compared with conventional im-plementations.Experiments on MPC7448(a well-known SIMD reduced instruction set computing processor) demonstrate that our optimal bit-reversal algorithm consistently takes fewer than two cycles per element in complex array operations.
基金the National Natural Science Foundation of China(No.61702521)the Natural Science Foundation of Tianjin(No.18JCQNJC00400)+1 种基金the Scientific Research Foundation of Civil Aviation University of China(No.2017QD12S)the Fundamental Research Funds for the Central Universities of Civil Aviation University of China(Nos.3122018C023 and 3122018C021)。
文摘Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows.
文摘Usually a single MPEG2 video encoder chip realizes the multiple post stage tasks of motion estimation, such as motion vector refinement and prediction error generation, using multiple hardware modules. This paper proposes a new architecture using only a single module to implement the post stage tasks of motion estimation, which has a single instruction stream over multiple data streams (SIMD). The new architecture is simple and more regular; capable of providing sufficient computational power and of adapting to the encoding flexibility required by the MPEG2 standard. Therefore, it is a more suitable architecture for the system on a chip. NEL Corporation (NTT Electronics, Japan) has integrated a circuit based on this architecture into the single MPEG2 MP@ML encoder chip, which uses the multiresolution telescopic search motion estimation algorithm. Using 0.25 μm CMOS, four metal layer technology, this circuit has 15.4 M gates with an area of about 29 mm 2. The operating clock frequency is 81 MHz.