Toeplitz matrix-vector multiplication is widely used in various fields,including optimal control,systolic finite field multipliers,multidimensional convolution,etc.In this paper,we first present a non-asymptotic quant...Toeplitz matrix-vector multiplication is widely used in various fields,including optimal control,systolic finite field multipliers,multidimensional convolution,etc.In this paper,we first present a non-asymptotic quantum algorithm for Toeplitz matrix-vector multiplication with time complexity O(κpolylogn),whereκand 2n are the condition number and the dimension of the circulant matrix extended from the Toeplitz matrix,respectively.For the case with an unknown generating function,we also give a corresponding non-asymptotic quantum version that eliminates the dependency on the L_(1)-normρof the displacement of the structured matrices.Due to the good use of the special properties of Toeplitz matrices,the proposed quantum algorithms are sufficiently accurate and efficient compared to the existing quantum algorithms under certain circumstances.展开更多
As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo a...As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.展开更多
With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware ...With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others.展开更多
One of the elementary operations in computing systems is multiplication.Therefore,high-speed and low-power multipliers design is mandatory for efficient computing systems.In designing low-energy dissipation circuits,r...One of the elementary operations in computing systems is multiplication.Therefore,high-speed and low-power multipliers design is mandatory for efficient computing systems.In designing low-energy dissipation circuits,reversible logic is more efficient than irreversible logic circuits but at the cost of higher complexity.This paper introduces an efficient signed/unsigned 4×4 reversible Vedic multiplier with minimum quantum cost.The Vedic multiplier is considered fast as it generates all partial product and their sum in one step.This paper proposes two reversible Vedic multipliers with optimized quantum cost and garbage output.First,the unsigned Vedic multiplier is designed based on the Urdhava Tiryakbhyam(UT)Sutra.This multiplier consists of bitwise multiplication and adder compressors.Compared with Vedic multipliers in the literature,the proposed design has a quantum cost of 111 with a reduction of 94%compared to the previous design.It has a garbage output of 30 with optimization of the best-compared design.Second,the proposed unsigned multiplier is expanded to allow the multiplication of signed numbers as well as unsigned numbers.Two signed Vedic multipliers are presented with the aim of obtaining more optimization in performance parameters.DesignI has separate binary two’s complement(B2C)and MUX circuits,while DesignII combines binary two’s complement and MUX circuits in one circuit.DesignI shows the lowest quantum cost,231,regarding state-ofthe-art.DesignII has a quantum cost of 199,reducing to 86.14%of DesignI.The functionality of the proposed multiplier is simulated and verified using XILINX ISE 14.2.展开更多
Approximate computing is a popularfield for low power consumption that is used in several applications like image processing,video processing,multi-media and data mining.This Approximate computing is majorly performed ...Approximate computing is a popularfield for low power consumption that is used in several applications like image processing,video processing,multi-media and data mining.This Approximate computing is majorly performed with an arithmetic circuit particular with a multiplier.The multiplier is the most essen-tial element used for approximate computing where the power consumption is majorly based on its performance.There are several researchers are worked on the approximate multiplier for power reduction for a few decades,but the design of low power approximate multiplier is not so easy.This seems a bigger challenge for digital industries to design an approximate multiplier with low power and minimum error rate with higher accuracy.To overcome these issues,the digital circuits are applied to the Deep Learning(DL)approaches for higher accuracy.In recent times,DL is the method that is used for higher learning and prediction accuracy in severalfields.Therefore,the Long Short-Term Memory(LSTM)is a popular time series DL method is used in this work for approximate computing.To provide an optimal solution,the LSTM is combined with a meta-heuristics Jel-lyfish search optimisation technique to design an input aware deep learning-based approximate multiplier(DLAM).In this work,the jelly optimised LSTM model is used to enhance the error metrics performance of the Approximate multiplier.The optimal hyperparameters of the LSTM model are identified by jelly search opti-misation.Thisfine-tuning is used to obtain an optimal solution to perform an LSTM with higher accuracy.The proposed pre-trained LSTM model is used to generate approximate design libraries for the different truncation levels as a func-tion of area,delay,power and error metrics.The experimental results on an 8-bit multiplier with an image processing application shows that the proposed approx-imate computing multiplier achieved a superior area and power reduction with very good results on error rates.展开更多
基金supported by the National Natural Science Foundation of China(Grant Nos.62071015 and 62171264)。
文摘Toeplitz matrix-vector multiplication is widely used in various fields,including optimal control,systolic finite field multipliers,multidimensional convolution,etc.In this paper,we first present a non-asymptotic quantum algorithm for Toeplitz matrix-vector multiplication with time complexity O(κpolylogn),whereκand 2n are the condition number and the dimension of the circulant matrix extended from the Toeplitz matrix,respectively.For the case with an unknown generating function,we also give a corresponding non-asymptotic quantum version that eliminates the dependency on the L_(1)-normρof the displacement of the structured matrices.Due to the good use of the special properties of Toeplitz matrices,the proposed quantum algorithms are sufficiently accurate and efficient compared to the existing quantum algorithms under certain circumstances.
文摘As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.
基金supported in part by the Major Program of the Ministry of Science and Technology of China under Grant 2019YFB2205102in part by the National Natural Science Foundation of China under Grant 61974164,62074166,61804181,62004219,62004220,62104256.
文摘With the continuous development of deep learning,Deep Convolutional Neural Network(DCNN)has attracted wide attention in the industry due to its high accuracy in image classification.Compared with other DCNN hard-ware deployment platforms,Field Programmable Gate Array(FPGA)has the advantages of being programmable,low power consumption,parallelism,and low cost.However,the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator.The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing,but this method’s data multiplexing rate is low because it repeatedly reads the data between rows.This paper proposes a fast data readout strategy via the circular sliding window data reading method,it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data.In addition,the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing(DSP)on the FPGA,which means that there will be a waste of resources if a multiplication uses a single DSP.A multiplier sharing strategy is proposed,the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4,6,and 8-bit signed multiplication in parallel.Finally,based on two strategies of appeal,an FPGA optimized accelerator is proposed.The accelerator is customized by Verilog language and deployed on Xilinx VCU118.When the accelerator recognizes the CIRFAR-10 dataset,its energy efficiency is 39.98 GOPS/W,which provides 1.73×speedup energy efficiency over previous DCNN FPGA accelerators.When the accelerator recognizes the IMAGENET dataset,its energy efficiency is 41.12 GOPS/W,which shows 1.28×−3.14×energy efficiency compared with others.
文摘One of the elementary operations in computing systems is multiplication.Therefore,high-speed and low-power multipliers design is mandatory for efficient computing systems.In designing low-energy dissipation circuits,reversible logic is more efficient than irreversible logic circuits but at the cost of higher complexity.This paper introduces an efficient signed/unsigned 4×4 reversible Vedic multiplier with minimum quantum cost.The Vedic multiplier is considered fast as it generates all partial product and their sum in one step.This paper proposes two reversible Vedic multipliers with optimized quantum cost and garbage output.First,the unsigned Vedic multiplier is designed based on the Urdhava Tiryakbhyam(UT)Sutra.This multiplier consists of bitwise multiplication and adder compressors.Compared with Vedic multipliers in the literature,the proposed design has a quantum cost of 111 with a reduction of 94%compared to the previous design.It has a garbage output of 30 with optimization of the best-compared design.Second,the proposed unsigned multiplier is expanded to allow the multiplication of signed numbers as well as unsigned numbers.Two signed Vedic multipliers are presented with the aim of obtaining more optimization in performance parameters.DesignI has separate binary two’s complement(B2C)and MUX circuits,while DesignII combines binary two’s complement and MUX circuits in one circuit.DesignI shows the lowest quantum cost,231,regarding state-ofthe-art.DesignII has a quantum cost of 199,reducing to 86.14%of DesignI.The functionality of the proposed multiplier is simulated and verified using XILINX ISE 14.2.
文摘Approximate computing is a popularfield for low power consumption that is used in several applications like image processing,video processing,multi-media and data mining.This Approximate computing is majorly performed with an arithmetic circuit particular with a multiplier.The multiplier is the most essen-tial element used for approximate computing where the power consumption is majorly based on its performance.There are several researchers are worked on the approximate multiplier for power reduction for a few decades,but the design of low power approximate multiplier is not so easy.This seems a bigger challenge for digital industries to design an approximate multiplier with low power and minimum error rate with higher accuracy.To overcome these issues,the digital circuits are applied to the Deep Learning(DL)approaches for higher accuracy.In recent times,DL is the method that is used for higher learning and prediction accuracy in severalfields.Therefore,the Long Short-Term Memory(LSTM)is a popular time series DL method is used in this work for approximate computing.To provide an optimal solution,the LSTM is combined with a meta-heuristics Jel-lyfish search optimisation technique to design an input aware deep learning-based approximate multiplier(DLAM).In this work,the jelly optimised LSTM model is used to enhance the error metrics performance of the Approximate multiplier.The optimal hyperparameters of the LSTM model are identified by jelly search opti-misation.Thisfine-tuning is used to obtain an optimal solution to perform an LSTM with higher accuracy.The proposed pre-trained LSTM model is used to generate approximate design libraries for the different truncation levels as a func-tion of area,delay,power and error metrics.The experimental results on an 8-bit multiplier with an image processing application shows that the proposed approx-imate computing multiplier achieved a superior area and power reduction with very good results on error rates.
文摘随着高分辨率对地观测要求的不断提高,合成孔径雷达(Synthetic Aperture Radar,SAR)的应用将越来越广泛。针对高分辨率SAR成像存在数据量大、存储难度高、计算时间长等问题,目前常用的解决方法是在SAR成像模型中引入压缩感知(Compressed Sensing,CS)的方法降低采样率和数据量。通常使用单一的正则化作为约束条件,可以抑制点目标旁瓣,实现点目标特征增强,但是观测场景中可能存在多种目标类型,因此使用单一正则化约束难以满足多种特征增强的要求。本文提出了一种基于复合正则化的稀疏高分辨SAR成像方法,通过压缩感知降低数据量,并使用多种正则化的线性组合作为约束条件,增强观测场景中不同类型目标的特征,实现复杂场景中高分辨率对地观测的要求。该方法在稀疏SAR成像模型中引入非凸正则化和全变分(Total Variation,TV)正则化作为约束条件,减小稀疏重构误差、增强区域目标的特征,降低噪声对成像结果的影响,提高成像质量;采用改进的交替方向乘子法(Alternating Direction Method of Multipliers,ADMM)实现复合正则化约束的求解,减少计算时间、快速重构图像;使用方位距离解耦算子代替观测矩阵及其共轭转置,进一步降低计算复杂度。仿真和实测数据实验表明,本文所提算法可以对点目标和区域目标进行特征增强,减小计算复杂度,提高收敛性能,实现快速高分辨的图像重构。