This paper presents a software turbo decoder on graphics processing units(GPU).Unlike previous works,the proposed decoding architecture for turbo codes mainly focuses on the Consultative Committee for Space Data Syste...This paper presents a software turbo decoder on graphics processing units(GPU).Unlike previous works,the proposed decoding architecture for turbo codes mainly focuses on the Consultative Committee for Space Data Systems(CCSDS)standard.However,the information frame lengths of the CCSDS turbo codes are not suitable for flexible sub-frame parallelism design.To mitigate this issue,we propose a padding method that inserts several bits before the information frame header.To obtain low-latency performance and high resource utilization,two-level intra-frame parallelisms and an efficient data structure are considered.The presented Max-Log-Map decoder can be adopted to decode the Long Term Evolution(LTE)turbo codes with only small modifications.The proposed CCSDS turbo decoder at 10 iterations on NVIDIA RTX3070 achieves about 150 Mbps and 50Mbps throughputs for the code rates 1/6 and 1/2,respectively.展开更多
A multi dimensional concatenation scheme for block codes is introduced, in which information symbols are interleaved and re encoded for more than once. It provides a convenient platform to design high performance co...A multi dimensional concatenation scheme for block codes is introduced, in which information symbols are interleaved and re encoded for more than once. It provides a convenient platform to design high performance codes with flexible interleaver size. Coset based MAP soft in/soft out decoding algorithms are presented for the F24 code. Simulation results show that the proposed coding scheme can achieve high coding gain with flexible interleaver length and very low decoding complexity.展开更多
Use of compressed mesh in parallel rendering architecture is still an unexplored area, the main challenge of which is to partition and sort the encoded mesh in compression-domain. This paper presents a mesh compressio...Use of compressed mesh in parallel rendering architecture is still an unexplored area, the main challenge of which is to partition and sort the encoded mesh in compression-domain. This paper presents a mesh compression scheme PRMC (Parallel Rendering based Mesh Compression) supplying encoded meshes that can be partitioned and sorted in parallel rendering system even in encoded-domain. First, we segment the mesh into submeshes and clip the submeshes’ boundary into Runs, and then piecewise compress the submeshes and Runs respectively. With the help of several auxiliary index tables, compressed submeshes and Runs can serve as rendering primitives in parallel rendering system. Based on PRMC, we design and implement a parallel rendering architecture. Compared with uncompressed representation, experimental results showed that PRMC meshes applied in cluster parallel rendering system can dramatically reduce the communication requirement.展开更多
Based on BCJR algorithm proposed by Bahlet al and linear soft decision feedback, a reduced-complexity parallel interference cancellation (simplified PIC) for convolutionally coded DS CDMA systems is proposed. By compu...Based on BCJR algorithm proposed by Bahlet al and linear soft decision feedback, a reduced-complexity parallel interference cancellation (simplified PIC) for convolutionally coded DS CDMA systems is proposed. By computer simulation, we compare the simplified PIC with the exact PIC. It shows that the simplified PIC can achieve the performance close to the exact PIC if the mean values of coded symbols are linearly computed in terms of the sum of initiala prior log-likelihood rate (LLR) and updateda prior LLR, while a significant performance loss will occur if the mean values of coded symbols are linearly computed in terms of the updateda prior LLR only. Meanwhile, we also compare the simplified PIC with MF receiver and conventional PICs. The simulation results show that the simplified PIC dominantly outperforms the MF receiver and conventional PICs, at signal-noise rate (SNR) of 7 dB, for example, the bit error rate is about 10?4 for the simplified PIC, which is far below that of matched-filter receiver and conventional PIC. Key words convolutionally coded CDMA - parallel interference cancellation - BCJR CLC number TN 914 Foundation item: Supported by the National Natural Science Foundation of China (69772015)Biography: Xu Guo-xiong (1967-), male, Ph. D candidate, research direction: wireless communication.展开更多
A code developed recently by the authors, for counting and computing the eigenvalues of a complex tridiagonal matrix, as well as the roots of a complex polynomial, which lie in a given region of the complex plane, is ...A code developed recently by the authors, for counting and computing the eigenvalues of a complex tridiagonal matrix, as well as the roots of a complex polynomial, which lie in a given region of the complex plane, is modified to run in parallel on multi-core machines. A basic characteristic of this code (eventually pointing to its parallelization) is that it can proceed with: 1) partitioning the given region into an appropriate number of subregions;2) counting eigenvalues in each subregion;and 3) computing (already counted) eigenvalues in each subregion. Consequently, theoretically speaking, the whole code in itself parallelizes ideally. We carry out several numerical experiments with random complex tridiagonal matrices, and random complex polynomials as well, in order to study the behaviour of the parallel code, especially the degree of declination from theoretical expectations.展开更多
In this paper we discuss a novel storage scheme for simultaneous memory access in parallel turbo decoder. The new scheme employs vertex coloring in graph theory. Compared to a similar method that also uses unnatural o...In this paper we discuss a novel storage scheme for simultaneous memory access in parallel turbo decoder. The new scheme employs vertex coloring in graph theory. Compared to a similar method that also uses unnatural order in storage, our scheme requires 25 more memory blocks but allows a simpler configuration for variable sizes of code lengths that can be implemented on-chip. Experiment shows that for a moderate to high decoding throughput (40-100 Mbps), the hardware cost is still affordable for 3GPP's (3rd generation partnership project) interleaver.展开更多
Genetic algorithms offer very good performances for solving large optimization problems, especially in the domain of error-correcting codes. However, they have a major drawback related to the time complexity and memor...Genetic algorithms offer very good performances for solving large optimization problems, especially in the domain of error-correcting codes. However, they have a major drawback related to the time complexity and memory occupation when running on a uniprocessor computer. This paper proposes a parallel decoder for linear block codes, using parallel genetic algorithms (PGA). The good performance and time complexity are confirmed by theoretical study and by simulations on BCH(63,30,14) codes over both AWGN and flat Rayleigh fading channels. The simulation results show that the coding gain between parallel and single genetic algorithm is about 0.7 dB at BER = 10﹣5 with only 4 processors.展开更多
In this paper, according to the AR4JA codes in deep space communication, two kinds of iterative decoding including partly parallel decoding and overlapped partly parallel decoding are analyzed, and the advantages and ...In this paper, according to the AR4JA codes in deep space communication, two kinds of iterative decoding including partly parallel decoding and overlapped partly parallel decoding are analyzed, and the advantages and disadvantages of them are listed. A modified overlapped partly parallel decoding that not only inherits the advantages of the two algorithms, but also overcomes the shortcomings of the two algorithms is proposed. The simulation results show that the three kinds of decoding have the same decoding performance; modified overlapped partly parallel decoding improves the iterative convergence rate and the throughput of system.展开更多
To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel ske...To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel skewing schemes to provide conflict-free access to adjacent elements (8-bit and 16-bit data types) or with power-of-two intervals in both horizontal and vertical directions, which were not possible in previous parallel memory architectures. Area consumptions and delay estimations are given respectively with 4, 8 and 16 memory modules. Under a 0.18-pm CMOS technology, the synthesis results show that the proposed system can achieve 230 MHz clock frequency with 16 memory modules at the cost of 19k gates when read and write latencies are 3 and 2 clock cycles, respectively. We implement the proposed parallel memory architecture on a video signal processor (VSP). The results show that VSP enhanced with the proposed architecture achieves 1.28× speedups for H.264 real-time decoding.展开更多
This study presents a calibration process of three-dimensional particle flow code(PFC3D)simulation of intact and fissured granite samples.First,laboratory stressestrain response from triaxial testing of intact and fis...This study presents a calibration process of three-dimensional particle flow code(PFC3D)simulation of intact and fissured granite samples.First,laboratory stressestrain response from triaxial testing of intact and fissured granite samples is recalled.Then,PFC3D is introduced,with focus on the bonded particle models(BPM).After that,we present previous studies where intact rock is simulated by means of flatjoint approaches,and how improved accuracy was gained with the help of parametric studies.Then,models of the pre-fissured rock specimens were generated,including modeled fissures in the form of“smooth joint”type contacts.Finally,triaxial testing simulations of 1 t 2 and 2 t 3 jointed rock specimens were performed.Results show that both elastic behavior and the peak strength levels are closely matched,without any additional fine tuning of micro-mechanical parameters.Concerning the postfailure behavior,models reproduce the trends of decreasing dilation with increasing confinement and plasticity.However,the dilation values simulated are larger than those observed in practice.This is attributed to the difficulty in modeling some phenomena of fissured rock behaviors,such as rock piece corner crushing with dust production and interactions between newly formed shear bands or axial splitting cracks with pre-existing joints.展开更多
In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performan...In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performance-oriented optimization techniques have been proposed. However, in order to satisfy the recent trend of power and energy consumptions, power/energy-aware optimization of GPUs needs to be investigated with detailed analysis in addition to the performance-oriented optimization. In this work, in order to explore the impact of various optimization strategies on GPU performance, power and energy consumptions, we evaluate performance and power/energy consumption of a well-known application running on different commercial GPU devices with the different optimization strategies. In particular, in order to see the more generalized performance and power consumption patterns of GPU based accelerations, our evaluations are performed with three different Nvdia GPU generations(Fermi, Kepler and Maxwell architectures), various core clock frequencies and memory clock frequencies. We analyze how a GPU kernel execution is affected by optimization and what GPU architectural factors have much impact on its performance and power/energy consumption. This paper also categorizes which optimization technique primarily improves which metric(i.e., performance, power or energy efficiency). Furthermore, voltage frequency scaling(VFS) is also applied to examine the effect of changing a clock frequency on these metrics. In general, our work shows that effective GPU optimization strategies can improve the application performance significantly without increasing power and energy consumption.展开更多
For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigura...For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigurable array processors provided by the project team, and uses data level parallel(DLP) algorithms in multi-core units. The experimental results show that Y-component of peak signal to noise ratio(Y-PSNR) is improved about 10 dB and the time is saved 63% compared with high-efficiency video coding(HEVC) test model HM10.0. This method can effectively reduce codec time of the video and reduce computational complexity.展开更多
In this paper,it has proposed a realtime implementation of low-density paritycheck(LDPC) decoder with less complexity used for satellite communication on FPGA platform.By adopting a(2048.4096)irregular quasi-cyclic(QC...In this paper,it has proposed a realtime implementation of low-density paritycheck(LDPC) decoder with less complexity used for satellite communication on FPGA platform.By adopting a(2048.4096)irregular quasi-cyclic(QC) LDPC code,the proposed partly parallel decoding structure balances the complexity between the check node unit(CNU) and the variable node unit(VNU) based on min-sum(MS) algorithm,thereby achieving less Slice resources and superior clock performance.Moreover,as a lookup table(LUT) is utilized in this paper to search the node message stored in timeshare memory unit,it is simple to reuse and save large amount of storage resources.The implementation results on Xilinx FPGA chip illustrate that,compared with conventional structure,the proposed scheme can achieve at last 28.6%and 8%cost reduction in RAM and Slice respectively.The clock frequency is also increased to 280 MHz without decoding performance deterioration and convergence speed reduction.展开更多
After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To re...After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.展开更多
At present, there are some static code analyses and optimizations that can be applied to Concurrent C programs to improve their performance or verify their logical correctness. These analyses and optimizations are int...At present, there are some static code analyses and optimizations that can be applied to Concurrent C programs to improve their performance or verify their logical correctness. These analyses and optimizations are inter-process. In order to make their implementation easy, we propose a new method to construct an optimizing compiling system CCOC for Concurrent C. CCOC supports inter-process code analysis and optimization to Concurrent C programs and does not affect the system's portability and separate compilation of source programs. We also discuss some implementation details of CCOC briefly.展开更多
A novel low-complexity iterative receiver for multiuser space frequency block coding (SFBC) system was proposed in this paper. Unlike the conventional linear minimum mean square error (MMSE) detector, which requires m...A novel low-complexity iterative receiver for multiuser space frequency block coding (SFBC) system was proposed in this paper. Unlike the conventional linear minimum mean square error (MMSE) detector, which requires matrix inversion at each iteration, the soft-in soft-out (SISO) detector is simply a parallel interference cancellation (PIC)-matched filter (MF) operation. The probability density function (PDF) of PIC-MF detector output is approximated as Gaussian, whose variance is calculated with a priori information fed back from the channel decoder. With this approximation, the log likelihood ratios (LLRs) of transmitted bits are under-estimated. Then the LLRs are multiplied by a constant factor to achieve a performance gain. The constant factor is optimized according to extrinsic information transfer (EXIT) chart of the SISO detector. Simulation results show that the proposed iterative receiver can significantly improve the system performance and converge to the matched filter bound (MFB) with low computational complexity at high signal-to-noise ratios (SNRs).展开更多
A high speed and low power Viterbi decoder architecture design based on deep pipelined, clock gating and toggle filtering has been presented in this paper. The Add-Compare-Select (ACS) and Trace Back (TB) units and it...A high speed and low power Viterbi decoder architecture design based on deep pipelined, clock gating and toggle filtering has been presented in this paper. The Add-Compare-Select (ACS) and Trace Back (TB) units and its sub circuits of the decoder have been operated in deep pipelined manner to achieve high transmission rate. The Power dissipation analysis is also investigated and compared with the existing results. The techniques that have been employed in our low-power design are clock-gating and toggle filtering. The synthesized circuits are placed and routed in the standard cell design environment and implemented on a Xilinx XC2VP2fg256-6 FPGA device. Power estimation obtained through gate level simulations indicated that the proposed design reduces the power dissipation of an original Viterbi decoder design by 68.82% and a speed of 145 MHz is achieved.展开更多
The time delay of Turbo codes due to its iterative decoding is the main bottleneck of its application in real-time channel. However, the time delay can be greatly shortened through the adoption of parallel decod-ing a...The time delay of Turbo codes due to its iterative decoding is the main bottleneck of its application in real-time channel. However, the time delay can be greatly shortened through the adoption of parallel decod-ing algorithm, dividing the received bits into several sub-blocks and processing in parallel. This letter mainly discusses the applicability of turbo codes in high-speed real-time channel through the study of a parallel turbo decoding algorithm based on 3GPP-proposed turbo encoder and interleaver in various channel. Simulation re-sult shows that, by choosing an appropriate sub-block length, the time delay can be obviously shortened with-out degrading the performance and increasing hardware complexity, and furthermore indicates the applicability of Turbo codes in high-speed real-time channel.展开更多
Wireless communication systems have greatly advanced during the last years. A significant contributor in these systems’ performance has been Orthogonal Frequency Division Multiplexing (OFDM). Since its invention, it ...Wireless communication systems have greatly advanced during the last years. A significant contributor in these systems’ performance has been Orthogonal Frequency Division Multiplexing (OFDM). Since its invention, it is considered to be a technological leap. This leap in splitting an information stream in multiple frequency carriers has been adapted by various scientists working on the development of wireless systems. Moreover, as OFDM presented excellent tolerance of channel fading and noise signals, the evolvement in terms of speed and reliability was consequent, because only a small stream of information is lost due to noise effects. OFDM along with the knowledge that Turbo codes is another excellent scheme of reducing BER, has triggered us to expand our research. So, we experimented in simulation level not only in joining OFDM with Turbo Codes but even in finding a better Turbo scheme compared to a typical PCCC, SCCC and a Convolutional encoder with Viterbi decoder. As the last goal has already been accomplished, in this paper is presented the new OFDM system consisted of our Turbo scheme. The analysis of the previous system took into consideration the effects of an AWGN channel. Also, this noise analysis was conducted using a simulation platform with specific attributes such as transmitting and receiving fixed number of subcarriers (2048 carriers after IFFT block) while using different types of convolutional concatenated codes, such as PCCC (Parallel), SCCC (Serial) and the new PCCC scheme. The results clearly show not only the improvement in the BER performance of the Turbo Coded OFDM systems (compared to others consisted of Viterbi decoders) but the overall superiority of the proposed design.展开更多
基金supported by the Fundamental Research Funds for the Central Universities(FRF-TP20-062A1)Guangdong Basic and Applied Basic Research Foundation(2021A1515110070)。
文摘This paper presents a software turbo decoder on graphics processing units(GPU).Unlike previous works,the proposed decoding architecture for turbo codes mainly focuses on the Consultative Committee for Space Data Systems(CCSDS)standard.However,the information frame lengths of the CCSDS turbo codes are not suitable for flexible sub-frame parallelism design.To mitigate this issue,we propose a padding method that inserts several bits before the information frame header.To obtain low-latency performance and high resource utilization,two-level intra-frame parallelisms and an efficient data structure are considered.The presented Max-Log-Map decoder can be adopted to decode the Long Term Evolution(LTE)turbo codes with only small modifications.The proposed CCSDS turbo decoder at 10 iterations on NVIDIA RTX3070 achieves about 150 Mbps and 50Mbps throughputs for the code rates 1/6 and 1/2,respectively.
文摘A multi dimensional concatenation scheme for block codes is introduced, in which information symbols are interleaved and re encoded for more than once. It provides a convenient platform to design high performance codes with flexible interleaver size. Coset based MAP soft in/soft out decoding algorithms are presented for the F24 code. Simulation results show that the proposed coding scheme can achieve high coding gain with flexible interleaver length and very low decoding complexity.
基金Project supported by the National Basic Research Program (973) of China (No. 2002CB312105), the National Natural Science Founda-tion of China (No. 60573074), the Natural Science Foundation of Shanxi Province, China (No. 20041040), Shanxi Foundation of Tackling Key Problem in Science and Technology (No. 051129), and Key NSFC Project of "Digital Olympic Museum" (No. 60533080), China
文摘Use of compressed mesh in parallel rendering architecture is still an unexplored area, the main challenge of which is to partition and sort the encoded mesh in compression-domain. This paper presents a mesh compression scheme PRMC (Parallel Rendering based Mesh Compression) supplying encoded meshes that can be partitioned and sorted in parallel rendering system even in encoded-domain. First, we segment the mesh into submeshes and clip the submeshes’ boundary into Runs, and then piecewise compress the submeshes and Runs respectively. With the help of several auxiliary index tables, compressed submeshes and Runs can serve as rendering primitives in parallel rendering system. Based on PRMC, we design and implement a parallel rendering architecture. Compared with uncompressed representation, experimental results showed that PRMC meshes applied in cluster parallel rendering system can dramatically reduce the communication requirement.
文摘Based on BCJR algorithm proposed by Bahlet al and linear soft decision feedback, a reduced-complexity parallel interference cancellation (simplified PIC) for convolutionally coded DS CDMA systems is proposed. By computer simulation, we compare the simplified PIC with the exact PIC. It shows that the simplified PIC can achieve the performance close to the exact PIC if the mean values of coded symbols are linearly computed in terms of the sum of initiala prior log-likelihood rate (LLR) and updateda prior LLR, while a significant performance loss will occur if the mean values of coded symbols are linearly computed in terms of the updateda prior LLR only. Meanwhile, we also compare the simplified PIC with MF receiver and conventional PICs. The simulation results show that the simplified PIC dominantly outperforms the MF receiver and conventional PICs, at signal-noise rate (SNR) of 7 dB, for example, the bit error rate is about 10?4 for the simplified PIC, which is far below that of matched-filter receiver and conventional PIC. Key words convolutionally coded CDMA - parallel interference cancellation - BCJR CLC number TN 914 Foundation item: Supported by the National Natural Science Foundation of China (69772015)Biography: Xu Guo-xiong (1967-), male, Ph. D candidate, research direction: wireless communication.
文摘A code developed recently by the authors, for counting and computing the eigenvalues of a complex tridiagonal matrix, as well as the roots of a complex polynomial, which lie in a given region of the complex plane, is modified to run in parallel on multi-core machines. A basic characteristic of this code (eventually pointing to its parallelization) is that it can proceed with: 1) partitioning the given region into an appropriate number of subregions;2) counting eigenvalues in each subregion;and 3) computing (already counted) eigenvalues in each subregion. Consequently, theoretically speaking, the whole code in itself parallelizes ideally. We carry out several numerical experiments with random complex tridiagonal matrices, and random complex polynomials as well, in order to study the behaviour of the parallel code, especially the degree of declination from theoretical expectations.
基金supported by the National High-Technology Research and Development Program of China (Grant No.2003AA123310), and the National Natural Science Foundation of China (Grant Nos.60332030, 60572157)
文摘In this paper we discuss a novel storage scheme for simultaneous memory access in parallel turbo decoder. The new scheme employs vertex coloring in graph theory. Compared to a similar method that also uses unnatural order in storage, our scheme requires 25 more memory blocks but allows a simpler configuration for variable sizes of code lengths that can be implemented on-chip. Experiment shows that for a moderate to high decoding throughput (40-100 Mbps), the hardware cost is still affordable for 3GPP's (3rd generation partnership project) interleaver.
文摘Genetic algorithms offer very good performances for solving large optimization problems, especially in the domain of error-correcting codes. However, they have a major drawback related to the time complexity and memory occupation when running on a uniprocessor computer. This paper proposes a parallel decoder for linear block codes, using parallel genetic algorithms (PGA). The good performance and time complexity are confirmed by theoretical study and by simulations on BCH(63,30,14) codes over both AWGN and flat Rayleigh fading channels. The simulation results show that the coding gain between parallel and single genetic algorithm is about 0.7 dB at BER = 10﹣5 with only 4 processors.
基金Sponsored by the National Natural Science Foundation of China( Grant No. 61032003)the Fundamental Research Funds for the Central Universities( Grant No. HIT. NSRIF.2012021)
文摘In this paper, according to the AR4JA codes in deep space communication, two kinds of iterative decoding including partly parallel decoding and overlapped partly parallel decoding are analyzed, and the advantages and disadvantages of them are listed. A modified overlapped partly parallel decoding that not only inherits the advantages of the two algorithms, but also overcomes the shortcomings of the two algorithms is proposed. The simulation results show that the three kinds of decoding have the same decoding performance; modified overlapped partly parallel decoding improves the iterative convergence rate and the throughput of system.
基金Project (No. 2005AA1Z1271) supported by the Hi-Tech Research and Development Program (863) of China
文摘To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel skewing schemes to provide conflict-free access to adjacent elements (8-bit and 16-bit data types) or with power-of-two intervals in both horizontal and vertical directions, which were not possible in previous parallel memory architectures. Area consumptions and delay estimations are given respectively with 4, 8 and 16 memory modules. Under a 0.18-pm CMOS technology, the synthesis results show that the proposed system can achieve 230 MHz clock frequency with 16 memory modules at the cost of 19k gates when read and write latencies are 3 and 2 clock cycles, respectively. We implement the proposed parallel memory architecture on a video signal processor (VSP). The results show that VSP enhanced with the proposed architecture achieves 1.28× speedups for H.264 real-time decoding.
基金The University of Vigo is acknowledged for financing part of the first author’s PhD studiesthe Spanish Ministry of Economy and Competitiveness for funding of the project‘Deepening on the behaviour of rock masses:Scale effects on the stressestrain response of fissured rock samples with particular emphasis on post-failure’,awarded under Contract Reference No.RTI2018-093563-B-I00partially financed by means of European Regional Development Funds from the European Union(EU)。
文摘This study presents a calibration process of three-dimensional particle flow code(PFC3D)simulation of intact and fissured granite samples.First,laboratory stressestrain response from triaxial testing of intact and fissured granite samples is recalled.Then,PFC3D is introduced,with focus on the bonded particle models(BPM).After that,we present previous studies where intact rock is simulated by means of flatjoint approaches,and how improved accuracy was gained with the help of parametric studies.Then,models of the pre-fissured rock specimens were generated,including modeled fissures in the form of“smooth joint”type contacts.Finally,triaxial testing simulations of 1 t 2 and 2 t 3 jointed rock specimens were performed.Results show that both elastic behavior and the peak strength levels are closely matched,without any additional fine tuning of micro-mechanical parameters.Concerning the postfailure behavior,models reproduce the trends of decreasing dilation with increasing confinement and plasticity.However,the dilation values simulated are larger than those observed in practice.This is attributed to the difficulty in modeling some phenomena of fissured rock behaviors,such as rock piece corner crushing with dust production and interactions between newly formed shear bands or axial splitting cracks with pre-existing joints.
基金supported by Basic Science Research Program through the National Research Foundation(2015R1D1A3A01019869),Korea
文摘In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performance-oriented optimization techniques have been proposed. However, in order to satisfy the recent trend of power and energy consumptions, power/energy-aware optimization of GPUs needs to be investigated with detailed analysis in addition to the performance-oriented optimization. In this work, in order to explore the impact of various optimization strategies on GPU performance, power and energy consumptions, we evaluate performance and power/energy consumption of a well-known application running on different commercial GPU devices with the different optimization strategies. In particular, in order to see the more generalized performance and power consumption patterns of GPU based accelerations, our evaluations are performed with three different Nvdia GPU generations(Fermi, Kepler and Maxwell architectures), various core clock frequencies and memory clock frequencies. We analyze how a GPU kernel execution is affected by optimization and what GPU architectural factors have much impact on its performance and power/energy consumption. This paper also categorizes which optimization technique primarily improves which metric(i.e., performance, power or energy efficiency). Furthermore, voltage frequency scaling(VFS) is also applied to examine the effect of changing a clock frequency on these metrics. In general, our work shows that effective GPU optimization strategies can improve the application performance significantly without increasing power and energy consumption.
基金Supported by the National Natural Science Foundation of China(No.61772417,61634004,61602377,61272120)the Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(No.2016KTZDGY02-04-02)the Shaanxi Provincial key R&D plan(No.2017GY-060)
文摘For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigurable array processors provided by the project team, and uses data level parallel(DLP) algorithms in multi-core units. The experimental results show that Y-component of peak signal to noise ratio(Y-PSNR) is improved about 10 dB and the time is saved 63% compared with high-efficiency video coding(HEVC) test model HM10.0. This method can effectively reduce codec time of the video and reduce computational complexity.
文摘In this paper,it has proposed a realtime implementation of low-density paritycheck(LDPC) decoder with less complexity used for satellite communication on FPGA platform.By adopting a(2048.4096)irregular quasi-cyclic(QC) LDPC code,the proposed partly parallel decoding structure balances the complexity between the check node unit(CNU) and the variable node unit(VNU) based on min-sum(MS) algorithm,thereby achieving less Slice resources and superior clock performance.Moreover,as a lookup table(LUT) is utilized in this paper to search the node message stored in timeshare memory unit,it is simple to reuse and save large amount of storage resources.The implementation results on Xilinx FPGA chip illustrate that,compared with conventional structure,the proposed scheme can achieve at last 28.6%and 8%cost reduction in RAM and Slice respectively.The clock frequency is also increased to 280 MHz without decoding performance deterioration and convergence speed reduction.
基金Supported by the National Natural Science Foundation of China(No.61834005,61772417,61802304,61602377,61874087,61634004)the Shaanxi Province Key R&D Plan(No.2020JM-525,2021GY-029,2021KW-16)。
文摘After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.
文摘At present, there are some static code analyses and optimizations that can be applied to Concurrent C programs to improve their performance or verify their logical correctness. These analyses and optimizations are inter-process. In order to make their implementation easy, we propose a new method to construct an optimizing compiling system CCOC for Concurrent C. CCOC supports inter-process code analysis and optimization to Concurrent C programs and does not affect the system's portability and separate compilation of source programs. We also discuss some implementation details of CCOC briefly.
基金The Science and Technology Committee of Shanghai Municipality ( No 06DZ15013,No03DZ15010)
文摘A novel low-complexity iterative receiver for multiuser space frequency block coding (SFBC) system was proposed in this paper. Unlike the conventional linear minimum mean square error (MMSE) detector, which requires matrix inversion at each iteration, the soft-in soft-out (SISO) detector is simply a parallel interference cancellation (PIC)-matched filter (MF) operation. The probability density function (PDF) of PIC-MF detector output is approximated as Gaussian, whose variance is calculated with a priori information fed back from the channel decoder. With this approximation, the log likelihood ratios (LLRs) of transmitted bits are under-estimated. Then the LLRs are multiplied by a constant factor to achieve a performance gain. The constant factor is optimized according to extrinsic information transfer (EXIT) chart of the SISO detector. Simulation results show that the proposed iterative receiver can significantly improve the system performance and converge to the matched filter bound (MFB) with low computational complexity at high signal-to-noise ratios (SNRs).
文摘A high speed and low power Viterbi decoder architecture design based on deep pipelined, clock gating and toggle filtering has been presented in this paper. The Add-Compare-Select (ACS) and Trace Back (TB) units and its sub circuits of the decoder have been operated in deep pipelined manner to achieve high transmission rate. The Power dissipation analysis is also investigated and compared with the existing results. The techniques that have been employed in our low-power design are clock-gating and toggle filtering. The synthesized circuits are placed and routed in the standard cell design environment and implemented on a Xilinx XC2VP2fg256-6 FPGA device. Power estimation obtained through gate level simulations indicated that the proposed design reduces the power dissipation of an original Viterbi decoder design by 68.82% and a speed of 145 MHz is achieved.
文摘The time delay of Turbo codes due to its iterative decoding is the main bottleneck of its application in real-time channel. However, the time delay can be greatly shortened through the adoption of parallel decod-ing algorithm, dividing the received bits into several sub-blocks and processing in parallel. This letter mainly discusses the applicability of turbo codes in high-speed real-time channel through the study of a parallel turbo decoding algorithm based on 3GPP-proposed turbo encoder and interleaver in various channel. Simulation re-sult shows that, by choosing an appropriate sub-block length, the time delay can be obviously shortened with-out degrading the performance and increasing hardware complexity, and furthermore indicates the applicability of Turbo codes in high-speed real-time channel.
文摘Wireless communication systems have greatly advanced during the last years. A significant contributor in these systems’ performance has been Orthogonal Frequency Division Multiplexing (OFDM). Since its invention, it is considered to be a technological leap. This leap in splitting an information stream in multiple frequency carriers has been adapted by various scientists working on the development of wireless systems. Moreover, as OFDM presented excellent tolerance of channel fading and noise signals, the evolvement in terms of speed and reliability was consequent, because only a small stream of information is lost due to noise effects. OFDM along with the knowledge that Turbo codes is another excellent scheme of reducing BER, has triggered us to expand our research. So, we experimented in simulation level not only in joining OFDM with Turbo Codes but even in finding a better Turbo scheme compared to a typical PCCC, SCCC and a Convolutional encoder with Viterbi decoder. As the last goal has already been accomplished, in this paper is presented the new OFDM system consisted of our Turbo scheme. The analysis of the previous system took into consideration the effects of an AWGN channel. Also, this noise analysis was conducted using a simulation platform with specific attributes such as transmitting and receiving fixed number of subcarriers (2048 carriers after IFFT block) while using different types of convolutional concatenated codes, such as PCCC (Parallel), SCCC (Serial) and the new PCCC scheme. The results clearly show not only the improvement in the BER performance of the Turbo Coded OFDM systems (compared to others consisted of Viterbi decoders) but the overall superiority of the proposed design.