As performance requirements for bus-based embedded System-on-Chips(So Cs) increase, more and more on-chip application-specific hardware accelerators(e.g., filters, FFTs, JPEG encoders, GSMs, and AES encoders) are bein...As performance requirements for bus-based embedded System-on-Chips(So Cs) increase, more and more on-chip application-specific hardware accelerators(e.g., filters, FFTs, JPEG encoders, GSMs, and AES encoders) are being integrated into their designs. These accelerators require system-level tradeoffs among performance, area, and scalability. Accelerator parallelization and Point-to-Point(P2P) interconnect insertion are two effective system-level adjustments. The former helps to boost the computing performance at the cost of area,while the latter provides higher bandwidth at the cost of routability. What’s more, they interact with each other. This paper proposes a design flow to optimize accelerator parallelization and P2 P interconnect insertion simultaneously.To explore the huge optimization space, we develop an effective algorithm, whose goal is to reduce total So C latency under the constraints of So C area and total P2 P wire length. Experimental results show that the performance difference between our proposed algorithm and the optimal results is only 2.33% on average, while the running time of the algorithm is less than 17 s.展开更多
Primitive assembly is an inevitable procedure of graphics rendering which performs the objects preparation for the following steps,however,the conventional approaches suffer from some issues,such as the missing of sur...Primitive assembly is an inevitable procedure of graphics rendering which performs the objects preparation for the following steps,however,the conventional approaches suffer from some issues,such as the missing of surface attribute,mismatch of color mode for clipped primitives,and performance bottleneck of rendering pipeline.This paper takes all these issues into considerations,and proposes a parallel primitive assembly accelerator(PPAA)which can solve not only the functional problems but also improve the shading performance.The register transfer level(RTL)circuit is designed and the detailed approach is presented.The prototype systems are implemented on Xilinx field programmable gate array(FPGA)XC6 VLX550 T and Altera FPGA EP2 C70 F896 C6.The experimental results show that PPAA can accomplish the assembly tasks correctly and with higher performance of 1.5x and 2.5x of two previous implementations.For the most frequently independent primitives,the PPAA can efficiently enhance the throughput by squeezing out the pipeline bubbles and by balancing the pipeline stages.展开更多
A graphics processing unit(GPU)-accelerated vector-form particle-element method,i.e.,the finite particle method(FPM),is proposed for 3D elastoplastic contact of structures involving strong nonlinearities and computati...A graphics processing unit(GPU)-accelerated vector-form particle-element method,i.e.,the finite particle method(FPM),is proposed for 3D elastoplastic contact of structures involving strong nonlinearities and computationally expensive contact calculations.A hexahedral FPM element with reduced integration and anti-hourglass is developed to model structural elastoplastic behaviors.The 3D space containing contact surfaces is decomposed into cubic cells and the contact search is performed between adjacent cells to improve search efficiency.A connected list data structure is used for storing contact particles to facilitate the parallel contact search procedure.The contact constraints are enforced by explicitly applying normal and tangential contact forces to the contact particles.The proposed method is fully accelerated by GPU-based parallel computing.After verification,the performance of the proposed method is compared with the serial finite element code Abaqus/Explicit by testing two large-scale contact examples.The maximum speedup of the proposed method over Abaqus/Explicit is approximately 80 for the overall computation and 340 for contact calculations.Therefore,the proposed method is shown to be effective and efficient.展开更多
In this paper, we present a parallel quasi-Chebyshev acceleration applied to the nonover- lapping multisplitting iterative method for the linear systems when the coefficient matrix is either an H-matrix or a symmetric...In this paper, we present a parallel quasi-Chebyshev acceleration applied to the nonover- lapping multisplitting iterative method for the linear systems when the coefficient matrix is either an H-matrix or a symmetric positive definite matrix. First, m parallel iterations are implemented in m different processors. Second, based on l1-norm or l2-norm, the m opti- mization models are parallelly treated in m different processors. The convergence theories are established for the parallel quasi-Chebyshev accelerated method. Finally, the numeri- cal examples show that the parallel quasi-Chebyshev technique can significantly accelerate the nonoverlapping multisplitting iterative method.展开更多
基金supported in part by the National Natural Science Foundation of China (No. 61271269)the National High-Tech Research and Development (863) Program (No. 2013AA01320)the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions (No. YETP0102)
文摘As performance requirements for bus-based embedded System-on-Chips(So Cs) increase, more and more on-chip application-specific hardware accelerators(e.g., filters, FFTs, JPEG encoders, GSMs, and AES encoders) are being integrated into their designs. These accelerators require system-level tradeoffs among performance, area, and scalability. Accelerator parallelization and Point-to-Point(P2P) interconnect insertion are two effective system-level adjustments. The former helps to boost the computing performance at the cost of area,while the latter provides higher bandwidth at the cost of routability. What’s more, they interact with each other. This paper proposes a design flow to optimize accelerator parallelization and P2 P interconnect insertion simultaneously.To explore the huge optimization space, we develop an effective algorithm, whose goal is to reduce total So C latency under the constraints of So C area and total P2 P wire length. Experimental results show that the performance difference between our proposed algorithm and the optimal results is only 2.33% on average, while the running time of the algorithm is less than 17 s.
基金supported by National Natural Science Foundation of China(61834005,61772417,61602377,61802304,61874087)Shaanxi International Science and Technology Cooperation Program(2018KW-006)+1 种基金Shaanxi Province Co-ordination Innovation Project of Science and Technology(2016KTZDGY02-04-02)Shaanxi Provincial Key R&D Plan(2017GY-060)。
文摘Primitive assembly is an inevitable procedure of graphics rendering which performs the objects preparation for the following steps,however,the conventional approaches suffer from some issues,such as the missing of surface attribute,mismatch of color mode for clipped primitives,and performance bottleneck of rendering pipeline.This paper takes all these issues into considerations,and proposes a parallel primitive assembly accelerator(PPAA)which can solve not only the functional problems but also improve the shading performance.The register transfer level(RTL)circuit is designed and the detailed approach is presented.The prototype systems are implemented on Xilinx field programmable gate array(FPGA)XC6 VLX550 T and Altera FPGA EP2 C70 F896 C6.The experimental results show that PPAA can accomplish the assembly tasks correctly and with higher performance of 1.5x and 2.5x of two previous implementations.For the most frequently independent primitives,the PPAA can efficiently enhance the throughput by squeezing out the pipeline bubbles and by balancing the pipeline stages.
基金supported by the National Natural Science Foundation of China(Nos.51908492,52008366,and 52238001)the Zhejiang Provincial Natural Science Foundation of China(Nos.LY21E080022 and LQ21E080019).
文摘A graphics processing unit(GPU)-accelerated vector-form particle-element method,i.e.,the finite particle method(FPM),is proposed for 3D elastoplastic contact of structures involving strong nonlinearities and computationally expensive contact calculations.A hexahedral FPM element with reduced integration and anti-hourglass is developed to model structural elastoplastic behaviors.The 3D space containing contact surfaces is decomposed into cubic cells and the contact search is performed between adjacent cells to improve search efficiency.A connected list data structure is used for storing contact particles to facilitate the parallel contact search procedure.The contact constraints are enforced by explicitly applying normal and tangential contact forces to the contact particles.The proposed method is fully accelerated by GPU-based parallel computing.After verification,the performance of the proposed method is compared with the serial finite element code Abaqus/Explicit by testing two large-scale contact examples.The maximum speedup of the proposed method over Abaqus/Explicit is approximately 80 for the overall computation and 340 for contact calculations.Therefore,the proposed method is shown to be effective and efficient.
文摘In this paper, we present a parallel quasi-Chebyshev acceleration applied to the nonover- lapping multisplitting iterative method for the linear systems when the coefficient matrix is either an H-matrix or a symmetric positive definite matrix. First, m parallel iterations are implemented in m different processors. Second, based on l1-norm or l2-norm, the m opti- mization models are parallelly treated in m different processors. The convergence theories are established for the parallel quasi-Chebyshev accelerated method. Finally, the numeri- cal examples show that the parallel quasi-Chebyshev technique can significantly accelerate the nonoverlapping multisplitting iterative method.