A multi dimensional concatenation scheme for block codes is introduced, in which information symbols are interleaved and re encoded for more than once. It provides a convenient platform to design high performance co...A multi dimensional concatenation scheme for block codes is introduced, in which information symbols are interleaved and re encoded for more than once. It provides a convenient platform to design high performance codes with flexible interleaver size. Coset based MAP soft in/soft out decoding algorithms are presented for the F24 code. Simulation results show that the proposed coding scheme can achieve high coding gain with flexible interleaver length and very low decoding complexity.展开更多
A code developed recently by the authors, for counting and computing the eigenvalues of a complex tridiagonal matrix, as well as the roots of a complex polynomial, which lie in a given region of the complex plane, is ...A code developed recently by the authors, for counting and computing the eigenvalues of a complex tridiagonal matrix, as well as the roots of a complex polynomial, which lie in a given region of the complex plane, is modified to run in parallel on multi-core machines. A basic characteristic of this code (eventually pointing to its parallelization) is that it can proceed with: 1) partitioning the given region into an appropriate number of subregions;2) counting eigenvalues in each subregion;and 3) computing (already counted) eigenvalues in each subregion. Consequently, theoretically speaking, the whole code in itself parallelizes ideally. We carry out several numerical experiments with random complex tridiagonal matrices, and random complex polynomials as well, in order to study the behaviour of the parallel code, especially the degree of declination from theoretical expectations.展开更多
Based on BCJR algorithm proposed by Bahlet al and linear soft decision feedback, a reduced-complexity parallel interference cancellation (simplified PIC) for convolutionally coded DS CDMA systems is proposed. By compu...Based on BCJR algorithm proposed by Bahlet al and linear soft decision feedback, a reduced-complexity parallel interference cancellation (simplified PIC) for convolutionally coded DS CDMA systems is proposed. By computer simulation, we compare the simplified PIC with the exact PIC. It shows that the simplified PIC can achieve the performance close to the exact PIC if the mean values of coded symbols are linearly computed in terms of the sum of initiala prior log-likelihood rate (LLR) and updateda prior LLR, while a significant performance loss will occur if the mean values of coded symbols are linearly computed in terms of the updateda prior LLR only. Meanwhile, we also compare the simplified PIC with MF receiver and conventional PICs. The simulation results show that the simplified PIC dominantly outperforms the MF receiver and conventional PICs, at signal-noise rate (SNR) of 7 dB, for example, the bit error rate is about 10?4 for the simplified PIC, which is far below that of matched-filter receiver and conventional PIC. Key words convolutionally coded CDMA - parallel interference cancellation - BCJR CLC number TN 914 Foundation item: Supported by the National Natural Science Foundation of China (69772015)Biography: Xu Guo-xiong (1967-), male, Ph. D candidate, research direction: wireless communication.展开更多
In this paper we discuss a novel storage scheme for simultaneous memory access in parallel turbo decoder. The new scheme employs vertex coloring in graph theory. Compared to a similar method that also uses unnatural o...In this paper we discuss a novel storage scheme for simultaneous memory access in parallel turbo decoder. The new scheme employs vertex coloring in graph theory. Compared to a similar method that also uses unnatural order in storage, our scheme requires 25 more memory blocks but allows a simpler configuration for variable sizes of code lengths that can be implemented on-chip. Experiment shows that for a moderate to high decoding throughput (40-100 Mbps), the hardware cost is still affordable for 3GPP's (3rd generation partnership project) interleaver.展开更多
In this paper, according to the AR4JA codes in deep space communication, two kinds of iterative decoding including partly parallel decoding and overlapped partly parallel decoding are analyzed, and the advantages and ...In this paper, according to the AR4JA codes in deep space communication, two kinds of iterative decoding including partly parallel decoding and overlapped partly parallel decoding are analyzed, and the advantages and disadvantages of them are listed. A modified overlapped partly parallel decoding that not only inherits the advantages of the two algorithms, but also overcomes the shortcomings of the two algorithms is proposed. The simulation results show that the three kinds of decoding have the same decoding performance; modified overlapped partly parallel decoding improves the iterative convergence rate and the throughput of system.展开更多
Genetic algorithms offer very good performances for solving large optimization problems, especially in the domain of error-correcting codes. However, they have a major drawback related to the time complexity and memor...Genetic algorithms offer very good performances for solving large optimization problems, especially in the domain of error-correcting codes. However, they have a major drawback related to the time complexity and memory occupation when running on a uniprocessor computer. This paper proposes a parallel decoder for linear block codes, using parallel genetic algorithms (PGA). The good performance and time complexity are confirmed by theoretical study and by simulations on BCH(63,30,14) codes over both AWGN and flat Rayleigh fading channels. The simulation results show that the coding gain between parallel and single genetic algorithm is about 0.7 dB at BER = 10﹣5 with only 4 processors.展开更多
外部函数接口(FFI)是解决一种编程语言调用其他语言函数库的主要方法。针对使用FFI技术时需要大量人工编码的问题,提出自动化外部函数接口生成(AFIG)方法。该方法利用基于抽象语法树的源码逆向分析技术,从被封装的库文件中精准提取出用...外部函数接口(FFI)是解决一种编程语言调用其他语言函数库的主要方法。针对使用FFI技术时需要大量人工编码的问题,提出自动化外部函数接口生成(AFIG)方法。该方法利用基于抽象语法树的源码逆向分析技术,从被封装的库文件中精准提取出用于描述函数接口信息的多语言融合的统一表示。基于此统一表示,不同平台的代码生成器可利用多语言转换规则矩阵,全自动化地生成不同平台的FFI相关代码。为解决FFI代码生成中的效率低下问题,设计了一种基于依赖分析的任务聚合策略,通过把存在依赖的任务聚合为新的任务,有效消除了FFI代码任务在并行下的阻塞与死锁,从而实现任务在多核系统下的可扩展与负载均衡。实验结果表明:与人工编码相比,AFIG方法减少了FFI开发中98.14%的开发编码量以及41.95%的测试编码量;与现有的SWIG(Simplified Wrapper and Interface Generator)方法相比,在同等任务下可减少61.27%的开发成本;且生成效率随着计算资源的增加呈线性增长。展开更多
In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performan...In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performance-oriented optimization techniques have been proposed. However, in order to satisfy the recent trend of power and energy consumptions, power/energy-aware optimization of GPUs needs to be investigated with detailed analysis in addition to the performance-oriented optimization. In this work, in order to explore the impact of various optimization strategies on GPU performance, power and energy consumptions, we evaluate performance and power/energy consumption of a well-known application running on different commercial GPU devices with the different optimization strategies. In particular, in order to see the more generalized performance and power consumption patterns of GPU based accelerations, our evaluations are performed with three different Nvdia GPU generations(Fermi, Kepler and Maxwell architectures), various core clock frequencies and memory clock frequencies. We analyze how a GPU kernel execution is affected by optimization and what GPU architectural factors have much impact on its performance and power/energy consumption. This paper also categorizes which optimization technique primarily improves which metric(i.e., performance, power or energy efficiency). Furthermore, voltage frequency scaling(VFS) is also applied to examine the effect of changing a clock frequency on these metrics. In general, our work shows that effective GPU optimization strategies can improve the application performance significantly without increasing power and energy consumption.展开更多
In this paper,it has proposed a realtime implementation of low-density paritycheck(LDPC) decoder with less complexity used for satellite communication on FPGA platform.By adopting a(2048.4096)irregular quasi-cyclic(QC...In this paper,it has proposed a realtime implementation of low-density paritycheck(LDPC) decoder with less complexity used for satellite communication on FPGA platform.By adopting a(2048.4096)irregular quasi-cyclic(QC) LDPC code,the proposed partly parallel decoding structure balances the complexity between the check node unit(CNU) and the variable node unit(VNU) based on min-sum(MS) algorithm,thereby achieving less Slice resources and superior clock performance.Moreover,as a lookup table(LUT) is utilized in this paper to search the node message stored in timeshare memory unit,it is simple to reuse and save large amount of storage resources.The implementation results on Xilinx FPGA chip illustrate that,compared with conventional structure,the proposed scheme can achieve at last 28.6%and 8%cost reduction in RAM and Slice respectively.The clock frequency is also increased to 280 MHz without decoding performance deterioration and convergence speed reduction.展开更多
Simulation is an important and useful technique helping users understand and model real life systems. Once built, the models can run proving realistic results. This supports making decisions on a more logical and scie...Simulation is an important and useful technique helping users understand and model real life systems. Once built, the models can run proving realistic results. This supports making decisions on a more logical and scientific basis. The paper introduces method of simulation, and describes various types of its application. The authors used the method of analysis of the creation and implementation of the programme code. The authors compared parallel instruction of computing defined to pipelined instructions. The power of simulation is that a common model can be used to design a large variety of systems. An important aspect of the simulation method is that a simulation model is designed to be repeated in actual computer systems, especially in multicore processors. For this reason, it is important to minimize average waiting time for fetch and decode stage instructions. The objective of the research is to prove that the parallel operation of programme code is faster than sequential operation code on the multi processor architecture. The system modeling uses methods and simulation on the parallel computer systems is very precise. The time benefit gained in simulation of mathematical model on the pipeline processor is higher than the one in simulation of mathematical model on the multi processors computer system.展开更多
Wireless communication systems have greatly advanced during the last years. A significant contributor in these systems’ performance has been Orthogonal Frequency Division Multiplexing (OFDM). Since its invention, it ...Wireless communication systems have greatly advanced during the last years. A significant contributor in these systems’ performance has been Orthogonal Frequency Division Multiplexing (OFDM). Since its invention, it is considered to be a technological leap. This leap in splitting an information stream in multiple frequency carriers has been adapted by various scientists working on the development of wireless systems. Moreover, as OFDM presented excellent tolerance of channel fading and noise signals, the evolvement in terms of speed and reliability was consequent, because only a small stream of information is lost due to noise effects. OFDM along with the knowledge that Turbo codes is another excellent scheme of reducing BER, has triggered us to expand our research. So, we experimented in simulation level not only in joining OFDM with Turbo Codes but even in finding a better Turbo scheme compared to a typical PCCC, SCCC and a Convolutional encoder with Viterbi decoder. As the last goal has already been accomplished, in this paper is presented the new OFDM system consisted of our Turbo scheme. The analysis of the previous system took into consideration the effects of an AWGN channel. Also, this noise analysis was conducted using a simulation platform with specific attributes such as transmitting and receiving fixed number of subcarriers (2048 carriers after IFFT block) while using different types of convolutional concatenated codes, such as PCCC (Parallel), SCCC (Serial) and the new PCCC scheme. The results clearly show not only the improvement in the BER performance of the Turbo Coded OFDM systems (compared to others consisted of Viterbi decoders) but the overall superiority of the proposed design.展开更多
Peak to Average Power Ratio (PAPR) is defined as the instantaneous power (maximum value) to the average power ratio. PAPR is considered to be a major problem in OFDM systems. This problem can cause radical unexpected ...Peak to Average Power Ratio (PAPR) is defined as the instantaneous power (maximum value) to the average power ratio. PAPR is considered to be a major problem in OFDM systems. This problem can cause radical unexpected behavior of the signal fluctuation. This fluctuation is constituted by a large number of power states. The enormous number of these states leads to an additional complexity of ADCs and DACs. This research addresses the previous problem in OFDM systems utilizing Turbo Codes. μLaCP technique is employed for the purpose of decreasing PAPR. Moreover, our OFDM system was simulated in the presence of an AWGN channel with four types of codes (without the presence of ADCs and DACs). These were constituted of PCCC (typical and new), SCCC, and Convolutional Codes. Our Turbo Coded OFDM exhibited unchanged BER performance before and after the use of μLaCP technique. This was accomplished by modifying our previous PAPR reduction technique without sacrificing greatly its attributes.展开更多
A high speed and low power Viterbi decoder architecture design based on deep pipelined, clock gating and toggle filtering has been presented in this paper. The Add-Compare-Select (ACS) and Trace Back (TB) units and it...A high speed and low power Viterbi decoder architecture design based on deep pipelined, clock gating and toggle filtering has been presented in this paper. The Add-Compare-Select (ACS) and Trace Back (TB) units and its sub circuits of the decoder have been operated in deep pipelined manner to achieve high transmission rate. The Power dissipation analysis is also investigated and compared with the existing results. The techniques that have been employed in our low-power design are clock-gating and toggle filtering. The synthesized circuits are placed and routed in the standard cell design environment and implemented on a Xilinx XC2VP2fg256-6 FPGA device. Power estimation obtained through gate level simulations indicated that the proposed design reduces the power dissipation of an original Viterbi decoder design by 68.82% and a speed of 145 MHz is achieved.展开更多
文摘A multi dimensional concatenation scheme for block codes is introduced, in which information symbols are interleaved and re encoded for more than once. It provides a convenient platform to design high performance codes with flexible interleaver size. Coset based MAP soft in/soft out decoding algorithms are presented for the F24 code. Simulation results show that the proposed coding scheme can achieve high coding gain with flexible interleaver length and very low decoding complexity.
文摘A code developed recently by the authors, for counting and computing the eigenvalues of a complex tridiagonal matrix, as well as the roots of a complex polynomial, which lie in a given region of the complex plane, is modified to run in parallel on multi-core machines. A basic characteristic of this code (eventually pointing to its parallelization) is that it can proceed with: 1) partitioning the given region into an appropriate number of subregions;2) counting eigenvalues in each subregion;and 3) computing (already counted) eigenvalues in each subregion. Consequently, theoretically speaking, the whole code in itself parallelizes ideally. We carry out several numerical experiments with random complex tridiagonal matrices, and random complex polynomials as well, in order to study the behaviour of the parallel code, especially the degree of declination from theoretical expectations.
文摘Based on BCJR algorithm proposed by Bahlet al and linear soft decision feedback, a reduced-complexity parallel interference cancellation (simplified PIC) for convolutionally coded DS CDMA systems is proposed. By computer simulation, we compare the simplified PIC with the exact PIC. It shows that the simplified PIC can achieve the performance close to the exact PIC if the mean values of coded symbols are linearly computed in terms of the sum of initiala prior log-likelihood rate (LLR) and updateda prior LLR, while a significant performance loss will occur if the mean values of coded symbols are linearly computed in terms of the updateda prior LLR only. Meanwhile, we also compare the simplified PIC with MF receiver and conventional PICs. The simulation results show that the simplified PIC dominantly outperforms the MF receiver and conventional PICs, at signal-noise rate (SNR) of 7 dB, for example, the bit error rate is about 10?4 for the simplified PIC, which is far below that of matched-filter receiver and conventional PIC. Key words convolutionally coded CDMA - parallel interference cancellation - BCJR CLC number TN 914 Foundation item: Supported by the National Natural Science Foundation of China (69772015)Biography: Xu Guo-xiong (1967-), male, Ph. D candidate, research direction: wireless communication.
基金supported by the National High-Technology Research and Development Program of China (Grant No.2003AA123310), and the National Natural Science Foundation of China (Grant Nos.60332030, 60572157)
文摘In this paper we discuss a novel storage scheme for simultaneous memory access in parallel turbo decoder. The new scheme employs vertex coloring in graph theory. Compared to a similar method that also uses unnatural order in storage, our scheme requires 25 more memory blocks but allows a simpler configuration for variable sizes of code lengths that can be implemented on-chip. Experiment shows that for a moderate to high decoding throughput (40-100 Mbps), the hardware cost is still affordable for 3GPP's (3rd generation partnership project) interleaver.
基金Sponsored by the National Natural Science Foundation of China( Grant No. 61032003)the Fundamental Research Funds for the Central Universities( Grant No. HIT. NSRIF.2012021)
文摘In this paper, according to the AR4JA codes in deep space communication, two kinds of iterative decoding including partly parallel decoding and overlapped partly parallel decoding are analyzed, and the advantages and disadvantages of them are listed. A modified overlapped partly parallel decoding that not only inherits the advantages of the two algorithms, but also overcomes the shortcomings of the two algorithms is proposed. The simulation results show that the three kinds of decoding have the same decoding performance; modified overlapped partly parallel decoding improves the iterative convergence rate and the throughput of system.
文摘Genetic algorithms offer very good performances for solving large optimization problems, especially in the domain of error-correcting codes. However, they have a major drawback related to the time complexity and memory occupation when running on a uniprocessor computer. This paper proposes a parallel decoder for linear block codes, using parallel genetic algorithms (PGA). The good performance and time complexity are confirmed by theoretical study and by simulations on BCH(63,30,14) codes over both AWGN and flat Rayleigh fading channels. The simulation results show that the coding gain between parallel and single genetic algorithm is about 0.7 dB at BER = 10﹣5 with only 4 processors.
文摘外部函数接口(FFI)是解决一种编程语言调用其他语言函数库的主要方法。针对使用FFI技术时需要大量人工编码的问题,提出自动化外部函数接口生成(AFIG)方法。该方法利用基于抽象语法树的源码逆向分析技术,从被封装的库文件中精准提取出用于描述函数接口信息的多语言融合的统一表示。基于此统一表示,不同平台的代码生成器可利用多语言转换规则矩阵,全自动化地生成不同平台的FFI相关代码。为解决FFI代码生成中的效率低下问题,设计了一种基于依赖分析的任务聚合策略,通过把存在依赖的任务聚合为新的任务,有效消除了FFI代码任务在并行下的阻塞与死锁,从而实现任务在多核系统下的可扩展与负载均衡。实验结果表明:与人工编码相比,AFIG方法减少了FFI开发中98.14%的开发编码量以及41.95%的测试编码量;与现有的SWIG(Simplified Wrapper and Interface Generator)方法相比,在同等任务下可减少61.27%的开发成本;且生成效率随着计算资源的增加呈线性增长。
基金supported by Basic Science Research Program through the National Research Foundation(2015R1D1A3A01019869),Korea
文摘In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performance-oriented optimization techniques have been proposed. However, in order to satisfy the recent trend of power and energy consumptions, power/energy-aware optimization of GPUs needs to be investigated with detailed analysis in addition to the performance-oriented optimization. In this work, in order to explore the impact of various optimization strategies on GPU performance, power and energy consumptions, we evaluate performance and power/energy consumption of a well-known application running on different commercial GPU devices with the different optimization strategies. In particular, in order to see the more generalized performance and power consumption patterns of GPU based accelerations, our evaluations are performed with three different Nvdia GPU generations(Fermi, Kepler and Maxwell architectures), various core clock frequencies and memory clock frequencies. We analyze how a GPU kernel execution is affected by optimization and what GPU architectural factors have much impact on its performance and power/energy consumption. This paper also categorizes which optimization technique primarily improves which metric(i.e., performance, power or energy efficiency). Furthermore, voltage frequency scaling(VFS) is also applied to examine the effect of changing a clock frequency on these metrics. In general, our work shows that effective GPU optimization strategies can improve the application performance significantly without increasing power and energy consumption.
文摘In this paper,it has proposed a realtime implementation of low-density paritycheck(LDPC) decoder with less complexity used for satellite communication on FPGA platform.By adopting a(2048.4096)irregular quasi-cyclic(QC) LDPC code,the proposed partly parallel decoding structure balances the complexity between the check node unit(CNU) and the variable node unit(VNU) based on min-sum(MS) algorithm,thereby achieving less Slice resources and superior clock performance.Moreover,as a lookup table(LUT) is utilized in this paper to search the node message stored in timeshare memory unit,it is simple to reuse and save large amount of storage resources.The implementation results on Xilinx FPGA chip illustrate that,compared with conventional structure,the proposed scheme can achieve at last 28.6%and 8%cost reduction in RAM and Slice respectively.The clock frequency is also increased to 280 MHz without decoding performance deterioration and convergence speed reduction.
文摘Simulation is an important and useful technique helping users understand and model real life systems. Once built, the models can run proving realistic results. This supports making decisions on a more logical and scientific basis. The paper introduces method of simulation, and describes various types of its application. The authors used the method of analysis of the creation and implementation of the programme code. The authors compared parallel instruction of computing defined to pipelined instructions. The power of simulation is that a common model can be used to design a large variety of systems. An important aspect of the simulation method is that a simulation model is designed to be repeated in actual computer systems, especially in multicore processors. For this reason, it is important to minimize average waiting time for fetch and decode stage instructions. The objective of the research is to prove that the parallel operation of programme code is faster than sequential operation code on the multi processor architecture. The system modeling uses methods and simulation on the parallel computer systems is very precise. The time benefit gained in simulation of mathematical model on the pipeline processor is higher than the one in simulation of mathematical model on the multi processors computer system.
文摘Wireless communication systems have greatly advanced during the last years. A significant contributor in these systems’ performance has been Orthogonal Frequency Division Multiplexing (OFDM). Since its invention, it is considered to be a technological leap. This leap in splitting an information stream in multiple frequency carriers has been adapted by various scientists working on the development of wireless systems. Moreover, as OFDM presented excellent tolerance of channel fading and noise signals, the evolvement in terms of speed and reliability was consequent, because only a small stream of information is lost due to noise effects. OFDM along with the knowledge that Turbo codes is another excellent scheme of reducing BER, has triggered us to expand our research. So, we experimented in simulation level not only in joining OFDM with Turbo Codes but even in finding a better Turbo scheme compared to a typical PCCC, SCCC and a Convolutional encoder with Viterbi decoder. As the last goal has already been accomplished, in this paper is presented the new OFDM system consisted of our Turbo scheme. The analysis of the previous system took into consideration the effects of an AWGN channel. Also, this noise analysis was conducted using a simulation platform with specific attributes such as transmitting and receiving fixed number of subcarriers (2048 carriers after IFFT block) while using different types of convolutional concatenated codes, such as PCCC (Parallel), SCCC (Serial) and the new PCCC scheme. The results clearly show not only the improvement in the BER performance of the Turbo Coded OFDM systems (compared to others consisted of Viterbi decoders) but the overall superiority of the proposed design.
文摘Peak to Average Power Ratio (PAPR) is defined as the instantaneous power (maximum value) to the average power ratio. PAPR is considered to be a major problem in OFDM systems. This problem can cause radical unexpected behavior of the signal fluctuation. This fluctuation is constituted by a large number of power states. The enormous number of these states leads to an additional complexity of ADCs and DACs. This research addresses the previous problem in OFDM systems utilizing Turbo Codes. μLaCP technique is employed for the purpose of decreasing PAPR. Moreover, our OFDM system was simulated in the presence of an AWGN channel with four types of codes (without the presence of ADCs and DACs). These were constituted of PCCC (typical and new), SCCC, and Convolutional Codes. Our Turbo Coded OFDM exhibited unchanged BER performance before and after the use of μLaCP technique. This was accomplished by modifying our previous PAPR reduction technique without sacrificing greatly its attributes.
文摘A high speed and low power Viterbi decoder architecture design based on deep pipelined, clock gating and toggle filtering has been presented in this paper. The Add-Compare-Select (ACS) and Trace Back (TB) units and its sub circuits of the decoder have been operated in deep pipelined manner to achieve high transmission rate. The Power dissipation analysis is also investigated and compared with the existing results. The techniques that have been employed in our low-power design are clock-gating and toggle filtering. The synthesized circuits are placed and routed in the standard cell design environment and implemented on a Xilinx XC2VP2fg256-6 FPGA device. Power estimation obtained through gate level simulations indicated that the proposed design reduces the power dissipation of an original Viterbi decoder design by 68.82% and a speed of 145 MHz is achieved.