Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t...Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.展开更多
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co...The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well.展开更多
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc...Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors.展开更多
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa...Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.展开更多
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ...This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU.展开更多
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin...Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%.展开更多
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p...We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor.展开更多
Task scheduling is an essential aspect of parallel process system. This NP-hard problem assumes fully connected homogeneous processors and ignores contention on the communication links. However, as arbitrary processor...Task scheduling is an essential aspect of parallel process system. This NP-hard problem assumes fully connected homogeneous processors and ignores contention on the communication links. However, as arbitrary processor network (APN), communication contention has a strong influence on the execution time of a parallel application. This paper investigates the incorporation of contention awareness into task scheduling. The innovation is the idea of dynamically scheduling edges to links, for which we use the earliest finish communication time search algorithm based on shortest-path search method. The other novel idea proposed in this paper is scheduling priority based on recursive rank computation on heterogeneous arbitrary processor network. In the end, to reduce time complexity of algorithm, a parallel algorithm is proposed and speedup O(PPE) is achieved. The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm significantly surpasses classic and static communication contention awareness algorithm, especially for high data transmission rate parallel application.展开更多
Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with gene...Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with general-purpose multi-processors, multi-core DSPs normally have a more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, efficient multi-core DSP applications are very difficult to write. The current approach used to program multi-core DSPs is based on proprietary vendor software development kits (SDKs), which only provide low-level, non-portable primitives. While it is acceptable to write coarse-grained task-level parallel code with these SDKs, writing fine-grained data parallel code with SDKs is a very tedious and error-prone approach. We believe that it is desirable to possess a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between the OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three classes of directives in OpenMDSP, including 1) data placement directives that allow programmers to control the placement of global variables conveniently, 2) distributed array directives that divide a whole array into sections and promote the sections into core-local memory to improve performance, and 3) stream access directives that promote big arrays into core-local memory section by section during parallel loop processing while hiding the latency of data movement by the direct memory access (DMA) of a DSP. We implement the compiler and runtime system for OpenMDSP on PreeScale MSC8156. The benchmarking results show that seven of nine benchmarks achieve a speedup of more than a factor of 5 when using six threads.展开更多
Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal pro...Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal processing, this paper presents a novel heterogeneous multi-core architecture digital signal processor (DSP), YHFT-QDSP, with one RISC CPU core and 4 VLIW DSP cores. By three kinds of interconnection, YHFT-QDSP provides high efficiency message communication for inner-chip RISC core and DSP cores, inner-chip and inter-chip DSP cores. A parallel programming platform is specifically developed for the heterogeneous nmlti-core architecture of YHFT-QDSP. This parallel programming environment provides a parallel support library and a friendly interface between high level application softwares and multi- core DSP. The 130 nm CMOS custom chip design results benchmarks show that the interconnection structure of in a high speed and moderate power design. The results of typical YHFT-QDSP is much better than other related structures and achieves better speedup when using the interconnection facilities in combing methods. YHFT-QDSP has been signed off and manufactured presently. The future applications of the multi-core chip could be found in 3G wireless base station, high performance radar, industrial applications, and so on.展开更多
Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core sc...Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core scenario based on the last level shared cache, this paper studies its performance stable condi- tions. Unfortunately, there is no existing model that allows extensive investigation of the impact of stable conditions, we present the base of pre-computation that is formalized by our degraded task-pair 〈 T, T' 〉 with the helper-thread, and its stable conditions are analyzed. Finally, a novel performance model and a constructing method of pre-computation based on our positive degraded task-pair are proposed. The efficient results are shown by our experiments. If we further exploit memory level parallelism (MLP) for our task-pair, the task-pair 〈 T, T' 〉 can reach better performance.展开更多
Established on the Intel Multi-Core Embedded platform, using 802.11 Wireless Network protocols as the communication medium, combining with Radio Frequency-Communication and Ultrasonic Ranging, imple-ment a mobile term...Established on the Intel Multi-Core Embedded platform, using 802.11 Wireless Network protocols as the communication medium, combining with Radio Frequency-Communication and Ultrasonic Ranging, imple-ment a mobile terminal system in an intellectualized building. It can provide its holder such functions: 1) Accurate Positioning 2) Intelligent Navigation 3) Video Monitoring 4) Wireless Communication. The inno-vative point for this paper is to apply the multi-core computing on the embedded system to promote its com-puting speed and give a real-time performance and apply this system into the indoor environment for the purpose of emergent event or rescuing.展开更多
This paper proposes the design and development of a novel, portable and low-cost intelligent electronic device (IED) for real-time monitoring of high frequency phenomena in CENELEC PLC band. A high speed floating-poin...This paper proposes the design and development of a novel, portable and low-cost intelligent electronic device (IED) for real-time monitoring of high frequency phenomena in CENELEC PLC band. A high speed floating-point digital signal processor (DSP) along with 4 MSPS analog-to-digital converter (ADC) is used to develop the intelligent electronic device. An optimized algorithm to process the analog signal in real-time and to extract the meaningful result using signal processing techniques has been implemented on the device. A laboratory environment has setup with all the necessary equipment including the development of the load model to evaluate the performance of the IED. Smart meter and concentrator is also connected to the low voltage (LV) network to monitor the PLC communication using the IED. The device has been tested in the laboratory and it has produced very promising results for time domain as well as frequency domain analysis. Those results imply that the IED is fully capable of monitoring high frequency disturbances in CENELEC PLC band.展开更多
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h...Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.展开更多
An adaptive polarization mode dispersion (PMD) compensation experiment is reported in a 40-Gb/s phase shaped binary transmission (PSBT) communication system, with the use of a new digital signal processor (DSP)-...An adaptive polarization mode dispersion (PMD) compensation experiment is reported in a 40-Gb/s phase shaped binary transmission (PSBT) communication system, with the use of a new digital signal processor (DSP)-based optical PMD compensator. PMD tolerance is found to be enhanced by 8 ps after PMD compensation with 1-dB optical signal-to-noise ratio (OSNR) penalty. Under the condition of fast change of states of polarization up to 85 rad/s in the fiber link, the performance of our PMD compensator undergoes the bit error ratio (BER) test for as long as 10 h.展开更多
High precision and stable clock is extremely important in communication and navigation.The miniaturization of the clocks is considered to be the trend to satisfy the demand for5G and the next generation communications...High precision and stable clock is extremely important in communication and navigation.The miniaturization of the clocks is considered to be the trend to satisfy the demand for5G and the next generation communications.Based on the concept of meter bar and the principle of the constancy of light velocity,we designed a micro clock,Space Time Clock(STC),with the size smaller than 1 mm×1 mm and the power dissipation less than 2 m W.Designed in integrated circuit of 0.18μm technology,the instability of STC is assessed to be 2.23×10^(-12)and the trend of the instability is reversely proportional toτ.With the potential ability to reach the level of 10instability on chip in the future,the period of the STC’s signal is locked on the delay time defined by the meter bar which keeps the time reference constant.Because of its superior performance,the STC is more suitable for mobile communication,PNT(Positioning,Navigation and Timing),embedded processor and deep space application,and becomes the main payload of the ASRTU satellite scheduled to launch next year and investigate in space environment.展开更多
基金Project(2008AA01A201) supported the National High-tech Research and Development Program of ChinaProjects(60833004, 60633050) supported by the National Natural Science Foundation of China
文摘Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.
文摘The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well.
文摘Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors.
基金Supported by the National High Technology Development 863 Program of China under Grant No.2008AA010901the National Natural Science Foundation of China under Grant Nos.60736012 and 60673146the National Basic Research 973 Program of China under Grant No.2005CB321601.
文摘Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.
文摘This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU.
基金Project supported by the National Natural Science Foundation of China(Nos.6122500861373074+3 种基金and 61373090)the National Basic Research Program(973)of China(No.2014CB349304)the Specialized Research Fund for the Doctoral Program of Higher Education,the Ministry of Education of China(No.20120002110033)the Tsinghua University Initiative Scientific Research Program
文摘Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%.
基金supported by the National Basic Research 973 Program of China under Grant No. 2007CB310900the National Natural Science Foundation of China under Grant No. 60725208Fellowships of the Japan Society for the Promotion of Sciencefor Young Scientists Program
文摘We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor.
基金Supported by the National Natural Science Foundation of China (Grant Nos. 90715029 and 60603053)the Cultivation Fund of the Key Scientific and Technical Innovation Project, Ministry of Edacation of Chinathe Key Project of Science & Technology of Hunan Province(Grant No. 2006GK2006)
文摘Task scheduling is an essential aspect of parallel process system. This NP-hard problem assumes fully connected homogeneous processors and ignores contention on the communication links. However, as arbitrary processor network (APN), communication contention has a strong influence on the execution time of a parallel application. This paper investigates the incorporation of contention awareness into task scheduling. The innovation is the idea of dynamically scheduling edges to links, for which we use the earliest finish communication time search algorithm based on shortest-path search method. The other novel idea proposed in this paper is scheduling priority based on recursive rank computation on heterogeneous arbitrary processor network. In the end, to reduce time complexity of algorithm, a parallel algorithm is proposed and speedup O(PPE) is achieved. The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm significantly surpasses classic and static communication contention awareness algorithm, especially for high data transmission rate parallel application.
基金supported by the National High Technology Research and Development 863 Program of China under Grant No.2012AA010901the National Natural Science Foundation of China under Grant No.61103021
文摘Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with general-purpose multi-processors, multi-core DSPs normally have a more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, efficient multi-core DSP applications are very difficult to write. The current approach used to program multi-core DSPs is based on proprietary vendor software development kits (SDKs), which only provide low-level, non-portable primitives. While it is acceptable to write coarse-grained task-level parallel code with these SDKs, writing fine-grained data parallel code with SDKs is a very tedious and error-prone approach. We believe that it is desirable to possess a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between the OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three classes of directives in OpenMDSP, including 1) data placement directives that allow programmers to control the placement of global variables conveniently, 2) distributed array directives that divide a whole array into sections and promote the sections into core-local memory to improve performance, and 3) stream access directives that promote big arrays into core-local memory section by section during parallel loop processing while hiding the latency of data movement by the direct memory access (DMA) of a DSP. We implement the compiler and runtime system for OpenMDSP on PreeScale MSC8156. The benchmarking results show that seven of nine benchmarks achieve a speedup of more than a factor of 5 when using six threads.
基金supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant No.2009ZX01034-001-001-006the National High Technology Research and Development 863 Program of China under Grant No.2007AA01Z108the Program for Changjiang Scholars and Innovative Research Team in Universities of China under Grant No.IRT0614.
文摘Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal processing, this paper presents a novel heterogeneous multi-core architecture digital signal processor (DSP), YHFT-QDSP, with one RISC CPU core and 4 VLIW DSP cores. By three kinds of interconnection, YHFT-QDSP provides high efficiency message communication for inner-chip RISC core and DSP cores, inner-chip and inter-chip DSP cores. A parallel programming platform is specifically developed for the heterogeneous nmlti-core architecture of YHFT-QDSP. This parallel programming environment provides a parallel support library and a friendly interface between high level application softwares and multi- core DSP. The 130 nm CMOS custom chip design results benchmarks show that the interconnection structure of in a high speed and moderate power design. The results of typical YHFT-QDSP is much better than other related structures and achieves better speedup when using the interconnection facilities in combing methods. YHFT-QDSP has been signed off and manufactured presently. The future applications of the multi-core chip could be found in 3G wireless base station, high performance radar, industrial applications, and so on.
文摘Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core scenario based on the last level shared cache, this paper studies its performance stable condi- tions. Unfortunately, there is no existing model that allows extensive investigation of the impact of stable conditions, we present the base of pre-computation that is formalized by our degraded task-pair 〈 T, T' 〉 with the helper-thread, and its stable conditions are analyzed. Finally, a novel performance model and a constructing method of pre-computation based on our positive degraded task-pair are proposed. The efficient results are shown by our experiments. If we further exploit memory level parallelism (MLP) for our task-pair, the task-pair 〈 T, T' 〉 can reach better performance.
文摘Established on the Intel Multi-Core Embedded platform, using 802.11 Wireless Network protocols as the communication medium, combining with Radio Frequency-Communication and Ultrasonic Ranging, imple-ment a mobile terminal system in an intellectualized building. It can provide its holder such functions: 1) Accurate Positioning 2) Intelligent Navigation 3) Video Monitoring 4) Wireless Communication. The inno-vative point for this paper is to apply the multi-core computing on the embedded system to promote its com-puting speed and give a real-time performance and apply this system into the indoor environment for the purpose of emergent event or rescuing.
文摘This paper proposes the design and development of a novel, portable and low-cost intelligent electronic device (IED) for real-time monitoring of high frequency phenomena in CENELEC PLC band. A high speed floating-point digital signal processor (DSP) along with 4 MSPS analog-to-digital converter (ADC) is used to develop the intelligent electronic device. An optimized algorithm to process the analog signal in real-time and to extract the meaningful result using signal processing techniques has been implemented on the device. A laboratory environment has setup with all the necessary equipment including the development of the load model to evaluate the performance of the IED. Smart meter and concentrator is also connected to the low voltage (LV) network to monitor the PLC communication using the IED. The device has been tested in the laboratory and it has produced very promising results for time domain as well as frequency domain analysis. Those results imply that the IED is fully capable of monitoring high frequency disturbances in CENELEC PLC band.
文摘Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.
基金supported by the National "863" Projectof China (No. 2009AA01Z224)Huawei Technology Project (No. YBON2008014)
文摘An adaptive polarization mode dispersion (PMD) compensation experiment is reported in a 40-Gb/s phase shaped binary transmission (PSBT) communication system, with the use of a new digital signal processor (DSP)-based optical PMD compensator. PMD tolerance is found to be enhanced by 8 ps after PMD compensation with 1-dB optical signal-to-noise ratio (OSNR) penalty. Under the condition of fast change of states of polarization up to 85 rad/s in the fiber link, the performance of our PMD compensator undergoes the bit error ratio (BER) test for as long as 10 h.
基金National Natural Science Foundation of China(No.11973021)Harbin Institute of Technology,Research Centre of Satellite Technology and Department of Microelectronics Science and Technologysupported by the ASRTU satellite project。
文摘High precision and stable clock is extremely important in communication and navigation.The miniaturization of the clocks is considered to be the trend to satisfy the demand for5G and the next generation communications.Based on the concept of meter bar and the principle of the constancy of light velocity,we designed a micro clock,Space Time Clock(STC),with the size smaller than 1 mm×1 mm and the power dissipation less than 2 m W.Designed in integrated circuit of 0.18μm technology,the instability of STC is assessed to be 2.23×10^(-12)and the trend of the instability is reversely proportional toτ.With the potential ability to reach the level of 10instability on chip in the future,the period of the STC’s signal is locked on the delay time defined by the meter bar which keeps the time reference constant.Because of its superior performance,the STC is more suitable for mobile communication,PNT(Positioning,Navigation and Timing),embedded processor and deep space application,and becomes the main payload of the ASRTU satellite scheduled to launch next year and investigate in space environment.