Graphics processing is an increasing important application domain with the demand of real-time rendering,video streaming,virtual reality,and so on.Illumination is a critical module in graphics rendering and is typical...Graphics processing is an increasing important application domain with the demand of real-time rendering,video streaming,virtual reality,and so on.Illumination is a critical module in graphics rendering and is typically compute-bound,memory-bound,and power-bound in different application cases.It is crucial to decide how to schedule different illumination algorithms with different features according to the practical requirements in reconfigurable graphics hardware.This paper analyze the performance characteristics of four main-stream lighting algorithms,Lambert illumination algorithm,Phong illumination algorithm,Blinn-Phong illumination algorithm,and Cook-Torrance illumination algorithm,using hardware performance counters on x86 processor platform KabyLake(KBL).The data movement,computation,power consumption,and memory accessing are evaluated over a range of application scenarios.Further,by analyzing the system-level behavior of these illumination algorithms,obtains the cons and pros of these specific algorithms were obtained.The associated relationship between performance/energy and the evaluated metrics was analyzed through Pearson correlation coefficient(PCC)analysis.According to these performance characterization data,this paper presents some reconfiguration suggestions in reconfigurable graphics processor.展开更多
Graphics processors have received an increasing attention with the growing demand for gaming,video streaming,and many other applications.During the graphics rendering with OpenGL,host CPU needs the runtime attributes ...Graphics processors have received an increasing attention with the growing demand for gaming,video streaming,and many other applications.During the graphics rendering with OpenGL,host CPU needs the runtime attributes to move on to the next procedure of rendering,which covers almost all the function units of graphics pipeline.Current methods suffer from the memory capacity issues to hold the variables or huge amount of data parsing paths which can cause congestion on the interface between graphics processor and host CPU.This paper refers to the operation principle of commuting bus,and proposes a bus-like data feedback mechanism(BFM)to traverse all the pipeline stages and collect the run-time status data or execution error of graphics rendering,then send them back to the host CPU.BFM can work in parallel with the graphics rendering logic.This method can complete the data feedback ta.sk easily with only 0.6%increase of resource utilization and has no negative impact on performance,which also obtains 1.3 times speed enhancement compared with a traditional approach.展开更多
Primitive assembly is an inevitable procedure of graphics rendering which performs the objects preparation for the following steps,however,the conventional approaches suffer from some issues,such as the missing of sur...Primitive assembly is an inevitable procedure of graphics rendering which performs the objects preparation for the following steps,however,the conventional approaches suffer from some issues,such as the missing of surface attribute,mismatch of color mode for clipped primitives,and performance bottleneck of rendering pipeline.This paper takes all these issues into considerations,and proposes a parallel primitive assembly accelerator(PPAA)which can solve not only the functional problems but also improve the shading performance.The register transfer level(RTL)circuit is designed and the detailed approach is presented.The prototype systems are implemented on Xilinx field programmable gate array(FPGA)XC6 VLX550 T and Altera FPGA EP2 C70 F896 C6.The experimental results show that PPAA can accomplish the assembly tasks correctly and with higher performance of 1.5x and 2.5x of two previous implementations.For the most frequently independent primitives,the PPAA can efficiently enhance the throughput by squeezing out the pipeline bubbles and by balancing the pipeline stages.展开更多
The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks ...The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to rapidly realize an algorithm prototype of complex networks. The performance of the MATLAB codes can be further improved by using graphic processor units (GPU). This paper presents the strategies and performance of the GPU implementation of a complex networks package, and the Jacket toolbox of MATLAB is used. Compared with some commercially available CPU implementations, GPU can achieve a speedup of, on average, 11.3x. The experimental result proves that the GPU platform combined with the MATLAB language is a good combination for complex network research.展开更多
Based on the three-dimensional particle-in-cell (PIC) method and Compute Unified Device Architecture (CUDA), a parallel particle simulation code combined with a graphic processor unit (GPU) has been developed fo...Based on the three-dimensional particle-in-cell (PIC) method and Compute Unified Device Architecture (CUDA), a parallel particle simulation code combined with a graphic processor unit (GPU) has been developed for the simulation of charge-exchange (CEX) xenon ions in the plume of an ion thruster. Using the proposed technique, the potential and CEX plasma distribution are calculated for the ion thruster plume surrounding the DS1 spacecraft at different thrust levels. The simulation results are in good agreement with measured CEX ion parameters reported in literature, and the CPU's results are equal to a CPU's. Compared with a single CPU Intel Core 2 E6300, 16-processor GPU NVIDIA GeForce 9400 GT indicates a speedup factor of 3.6 when the total macro particle number is 1.1 × 10^6. The simulation results also reveal how the back flow CEX plasma affects the spacecraft floating potential, which indicates that the plume of the ion thruster is indeed able to alleviate the extreme negative floating potentials of spacecraft in geosynchronous orbit.展开更多
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes. LDPC decoders have been implemented as efficient error correction codes on dedicated VLSI hardware architectures in recent years. This paper...Low-Density Parity-Check (LDPC) codes are powerful error correcting codes. LDPC decoders have been implemented as efficient error correction codes on dedicated VLSI hardware architectures in recent years. This paper describes two strategies to parallelize min-sum decoding of irregular LDPC codes. The first implements min-sum LDPC decoders on multicore platforms using OpenMP, while the other uses the Compute Unified Device Architecture (CUDA) to parallelize LDPC decoding on Graphics Processing Units (GPUs). Empirical studies on data with various scales show that the performance of these decoding processes is improved by these parallel strategies and the GPUs provide more efficient, fast implementation decoder.展开更多
With the unstructured grid, the Finite Volume Coastal Ocean Model(FVCOM) is converted from its original FORTRAN code to a Compute Unified Device Architecture(CUDA) C code, and optimized on the Graphic Processor U...With the unstructured grid, the Finite Volume Coastal Ocean Model(FVCOM) is converted from its original FORTRAN code to a Compute Unified Device Architecture(CUDA) C code, and optimized on the Graphic Processor Unit(GPU). The proposed GPU-FVCOM is tested against analytical solutions for two standard cases in a rectangular basin, a tide induced flow and a wind induced circulation. It is then applied to the Ningbo's coastal water area to simulate the tidal motion and analyze the flow field and the vertical tide velocity structure. The simulation results agree with the measured data quite well. The accelerated performance of the proposed 3-D model reaches 30 times of that of a single thread program, and the GPU-FVCOM implemented on a Tesla k20 device is faster than on a workstation with 20 CPU cores, which shows that the GPU-FVCOM is efficient for solving large scale sea area and high resolution engineering problems.展开更多
基金supported by the Natural National Science Foundation of China (61602377, 61834005, 61772417, 61802304,61874087,61634004)the International Science and Technology Cooperation Program of Shaanxi (2018KW006)
文摘Graphics processing is an increasing important application domain with the demand of real-time rendering,video streaming,virtual reality,and so on.Illumination is a critical module in graphics rendering and is typically compute-bound,memory-bound,and power-bound in different application cases.It is crucial to decide how to schedule different illumination algorithms with different features according to the practical requirements in reconfigurable graphics hardware.This paper analyze the performance characteristics of four main-stream lighting algorithms,Lambert illumination algorithm,Phong illumination algorithm,Blinn-Phong illumination algorithm,and Cook-Torrance illumination algorithm,using hardware performance counters on x86 processor platform KabyLake(KBL).The data movement,computation,power consumption,and memory accessing are evaluated over a range of application scenarios.Further,by analyzing the system-level behavior of these illumination algorithms,obtains the cons and pros of these specific algorithms were obtained.The associated relationship between performance/energy and the evaluated metrics was analyzed through Pearson correlation coefficient(PCC)analysis.According to these performance characterization data,this paper presents some reconfiguration suggestions in reconfigurable graphics processor.
基金the National Natural Science Foundation of China(Nos.61834005,61772417,61602377,61802304 and 61874087)the International Science and Technology Cooperation Program of Shaanxi China(No.2018KW-006)。
文摘Graphics processors have received an increasing attention with the growing demand for gaming,video streaming,and many other applications.During the graphics rendering with OpenGL,host CPU needs the runtime attributes to move on to the next procedure of rendering,which covers almost all the function units of graphics pipeline.Current methods suffer from the memory capacity issues to hold the variables or huge amount of data parsing paths which can cause congestion on the interface between graphics processor and host CPU.This paper refers to the operation principle of commuting bus,and proposes a bus-like data feedback mechanism(BFM)to traverse all the pipeline stages and collect the run-time status data or execution error of graphics rendering,then send them back to the host CPU.BFM can work in parallel with the graphics rendering logic.This method can complete the data feedback ta.sk easily with only 0.6%increase of resource utilization and has no negative impact on performance,which also obtains 1.3 times speed enhancement compared with a traditional approach.
基金supported by National Natural Science Foundation of China(61834005,61772417,61602377,61802304,61874087)Shaanxi International Science and Technology Cooperation Program(2018KW-006)+1 种基金Shaanxi Province Co-ordination Innovation Project of Science and Technology(2016KTZDGY02-04-02)Shaanxi Provincial Key R&D Plan(2017GY-060)。
文摘Primitive assembly is an inevitable procedure of graphics rendering which performs the objects preparation for the following steps,however,the conventional approaches suffer from some issues,such as the missing of surface attribute,mismatch of color mode for clipped primitives,and performance bottleneck of rendering pipeline.This paper takes all these issues into considerations,and proposes a parallel primitive assembly accelerator(PPAA)which can solve not only the functional problems but also improve the shading performance.The register transfer level(RTL)circuit is designed and the detailed approach is presented.The prototype systems are implemented on Xilinx field programmable gate array(FPGA)XC6 VLX550 T and Altera FPGA EP2 C70 F896 C6.The experimental results show that PPAA can accomplish the assembly tasks correctly and with higher performance of 1.5x and 2.5x of two previous implementations.For the most frequently independent primitives,the PPAA can efficiently enhance the throughput by squeezing out the pipeline bubbles and by balancing the pipeline stages.
基金Project supported by the Science Fund for Creative Research Groups of the National Natural Science Foundation of China (Grant No.60921062)the National Natural Science Foundation of China (Grant No.60873014)the Young Scientists Fund of the National Natural Science Foundation of China (Grant Nos.61003082 and 60903059)
文摘The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to rapidly realize an algorithm prototype of complex networks. The performance of the MATLAB codes can be further improved by using graphic processor units (GPU). This paper presents the strategies and performance of the GPU implementation of a complex networks package, and the Jacket toolbox of MATLAB is used. Compared with some commercially available CPU implementations, GPU can achieve a speedup of, on average, 11.3x. The experimental result proves that the GPU platform combined with the MATLAB language is a good combination for complex network research.
基金supported by National Natural Science Foundation of China (No. 10805004)Foundation of National Key Lab. of Science and Technology on Vacuum & Cryogenic of China (No. 9140C550404100C55)
文摘Based on the three-dimensional particle-in-cell (PIC) method and Compute Unified Device Architecture (CUDA), a parallel particle simulation code combined with a graphic processor unit (GPU) has been developed for the simulation of charge-exchange (CEX) xenon ions in the plume of an ion thruster. Using the proposed technique, the potential and CEX plasma distribution are calculated for the ion thruster plume surrounding the DS1 spacecraft at different thrust levels. The simulation results are in good agreement with measured CEX ion parameters reported in literature, and the CPU's results are equal to a CPU's. Compared with a single CPU Intel Core 2 E6300, 16-processor GPU NVIDIA GeForce 9400 GT indicates a speedup factor of 3.6 when the total macro particle number is 1.1 × 10^6. The simulation results also reveal how the back flow CEX plasma affects the spacecraft floating potential, which indicates that the plume of the ion thruster is indeed able to alleviate the extreme negative floating potentials of spacecraft in geosynchronous orbit.
基金Agilent Technology Foundation(No.912-CHN09)National Natural Science Foundation of China(No.61175110)+1 种基金National Key Basic Research and Development(973)Program of China(No.2012CB316305)National Key Projects of Science and Technology of China(No.2011ZX02101-004)
文摘Low-Density Parity-Check (LDPC) codes are powerful error correcting codes. LDPC decoders have been implemented as efficient error correction codes on dedicated VLSI hardware architectures in recent years. This paper describes two strategies to parallelize min-sum decoding of irregular LDPC codes. The first implements min-sum LDPC decoders on multicore platforms using OpenMP, while the other uses the Compute Unified Device Architecture (CUDA) to parallelize LDPC decoding on Graphics Processing Units (GPUs). Empirical studies on data with various scales show that the performance of these decoding processes is improved by these parallel strategies and the GPUs provide more efficient, fast implementation decoder.
基金Project supported by the National Natural Science Foundation of China(Grant No.51279028,51479175)the Public Science and Technology Research Funds Projects of Ocean(Grant No.201405025)
文摘With the unstructured grid, the Finite Volume Coastal Ocean Model(FVCOM) is converted from its original FORTRAN code to a Compute Unified Device Architecture(CUDA) C code, and optimized on the Graphic Processor Unit(GPU). The proposed GPU-FVCOM is tested against analytical solutions for two standard cases in a rectangular basin, a tide induced flow and a wind induced circulation. It is then applied to the Ningbo's coastal water area to simulate the tidal motion and analyze the flow field and the vertical tide velocity structure. The simulation results agree with the measured data quite well. The accelerated performance of the proposed 3-D model reaches 30 times of that of a single thread program, and the GPU-FVCOM implemented on a Tesla k20 device is faster than on a workstation with 20 CPU cores, which shows that the GPU-FVCOM is efficient for solving large scale sea area and high resolution engineering problems.