期刊文献+
共找到16篇文章
< 1 >
每页显示 20 50 100
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
1
作者 邓林 窦勇 《Journal of Central South University》 SCIE EI CAS 2011年第2期490-498,共9页
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t... Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores. 展开更多
关键词 multi-core processor NAS parallelization CG memory optimization
下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
下载PDF
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
3
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
下载PDF
System Architecture of Godson-3 Multi-Core Processors 被引量:7
4
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
Parallel computing of discrete element method on multi-core processors 被引量:6
5
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
Energy Efficiency of a Multi-Core Processor by Tag Reduction
6
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
7
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor HETEROGENEITY Representative chip operating point
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
8
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
9
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
基于多核处理器的IPSec VPN系统安全策略检索研究 被引量:6
10
作者 陈亮 王建 赵勇 《计算机工程与应用》 CSCD 北大核心 2017年第23期67-71,共5页
为满足现代高带宽互联网应用环境中的安全性保障要求,提出了一种基于Tilera GX36多核网络处理平台的IPSec VPN系统结构,利用SDN思想设计了系统控制面和数据面的程序功能模块,实现网络流量安全的灵活控制。针对系统中安全策略检索性能要... 为满足现代高带宽互联网应用环境中的安全性保障要求,提出了一种基于Tilera GX36多核网络处理平台的IPSec VPN系统结构,利用SDN思想设计了系统控制面和数据面的程序功能模块,实现网络流量安全的灵活控制。针对系统中安全策略检索性能要求,提出一种基于Hash的三级安全策略流表存储结构,并以各Tile CPU缓存中的安全关联流表为快速检索数据源设计一种安全策略检索方法。测试结果表明,对于互联网中典型的小包、中包和大包应用场景,该系统均能达到近40 Gb/s的处理性能。 展开更多
关键词 多核网络处理器 tilera GX36 IPSEC VPN 安全策略检索
下载PDF
基于多核处理器的高清实时MPEG-2——H.264转码器设计 被引量:2
11
作者 叶朝敏 陈颖琪 高志勇 《电视技术》 北大核心 2012年第21期15-19,共5页
基于多核处理器的并行计算为实时实现高清MPEG-2至H.264转码器提供了1种可行的实现方案。提出了1种多颗粒度的MPEG-2—H.264全解全编并行转码器设计方案,其中MPEG-2解码器采用了帧内与数据级两级并行,H.264编码器采用了帧间、帧内与数... 基于多核处理器的并行计算为实时实现高清MPEG-2至H.264转码器提供了1种可行的实现方案。提出了1种多颗粒度的MPEG-2—H.264全解全编并行转码器设计方案,其中MPEG-2解码器采用了帧内与数据级两级并行,H.264编码器采用了帧间、帧内与数据级三级并行。实验结果表明本设计不仅具有良好的并行加速比,而且可以在使用1/4的TilePro64处理器核资源的条件下完成1路实时高清转码。 展开更多
关键词 MPEG-H.264转码器 tilera多核处理器 多粒度并行计算
下载PDF
高清电子内窥镜系统低时延视频处理设计 被引量:3
12
作者 李益庆 陈耀武 《计算机工程》 CAS CSCD 2014年第12期258-261,266,共5页
针对高清电子内窥镜视频处理系统的高实时性要求,提出一种基于TILE-Gx36多核处理器的高清电子内窥镜系统低时延视频处理设计方案。将内窥镜的视频处理任务按流水线的处理方式划分为视频采集、视频处理和视频送显,并利用多核处理器的并... 针对高清电子内窥镜视频处理系统的高实时性要求,提出一种基于TILE-Gx36多核处理器的高清电子内窥镜系统低时延视频处理设计方案。将内窥镜的视频处理任务按流水线的处理方式划分为视频采集、视频处理和视频送显,并利用多核处理器的并行处理能力,从视频采集、视频处理2个环节入手,采用减少视频采集等待时间的方法来减少采集延时,通过优化插值算法和实现多任务负载均衡的并行处理来提高视频处理环节的效率。实验结果表明,该方案的实时视频处理延时在19.4ms以内,达到了医用电子内窥镜领域的较高水平。 展开更多
关键词 高清电子内窥镜 低时延 视频处理 tilera多核处理器 负载均衡 双线性插值
下载PDF
OpenMDSP:Extending OpenMP to Program Multi-Core DSPs 被引量:1
13
作者 何江舟 陈文光 +3 位作者 陈光日 郑纬民 汤志忠 叶寒栋 《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第2期316-331,共16页
Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with gene... Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with general-purpose multi-processors, multi-core DSPs normally have a more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, efficient multi-core DSP applications are very difficult to write. The current approach used to program multi-core DSPs is based on proprietary vendor software development kits (SDKs), which only provide low-level, non-portable primitives. While it is acceptable to write coarse-grained task-level parallel code with these SDKs, writing fine-grained data parallel code with SDKs is a very tedious and error-prone approach. We believe that it is desirable to possess a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between the OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three classes of directives in OpenMDSP, including 1) data placement directives that allow programmers to control the placement of global variables conveniently, 2) distributed array directives that divide a whole array into sections and promote the sections into core-local memory to improve performance, and 3) stream access directives that promote big arrays into core-local memory section by section during parallel loop processing while hiding the latency of data movement by the direct memory access (DMA) of a DSP. We implement the compiler and runtime system for OpenMDSP on PreeScale MSC8156. The benchmarking results show that seven of nine benchmarks achieve a speedup of more than a factor of 5 when using six threads. 展开更多
关键词 OPENMP multi-core digital signal processor data parallelism Long Term Evolution
原文传递
YHFT-QDSP:High-Performance Heterogeneous Multi-Core DSP
14
作者 陈书明 万江华 +8 位作者 鲁建壮 刘仲 孙海燕 孙永节 刘衡竹 刘祥远 李振涛 徐毅 陈小文 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期214-224,共11页
Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal pro... Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal processing, this paper presents a novel heterogeneous multi-core architecture digital signal processor (DSP), YHFT-QDSP, with one RISC CPU core and 4 VLIW DSP cores. By three kinds of interconnection, YHFT-QDSP provides high efficiency message communication for inner-chip RISC core and DSP cores, inner-chip and inter-chip DSP cores. A parallel programming platform is specifically developed for the heterogeneous nmlti-core architecture of YHFT-QDSP. This parallel programming environment provides a parallel support library and a friendly interface between high level application softwares and multi- core DSP. The 130 nm CMOS custom chip design results benchmarks show that the interconnection structure of in a high speed and moderate power design. The results of typical YHFT-QDSP is much better than other related structures and achieves better speedup when using the interconnection facilities in combing methods. YHFT-QDSP has been signed off and manufactured presently. The future applications of the multi-core chip could be found in 3G wireless base station, high performance radar, industrial applications, and so on. 展开更多
关键词 digital signal processor (DSP) multi-core ARCHITECTURE parallel programming custom design
原文传递
Performance modeling of positive degraded task-pair with helper-thread in CMP
15
作者 Gu Zhimin Zheng Ninghan +3 位作者 Zhang Yi Liu Changding Tang Jie Huang Yan 《High Technology Letters》 EI CAS 2010年第3期221-226,共6页
Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core sc... Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core scenario based on the last level shared cache, this paper studies its performance stable condi- tions. Unfortunately, there is no existing model that allows extensive investigation of the impact of stable conditions, we present the base of pre-computation that is formalized by our degraded task-pair 〈 T, T' 〉 with the helper-thread, and its stable conditions are analyzed. Finally, a novel performance model and a constructing method of pre-computation based on our positive degraded task-pair are proposed. The efficient results are shown by our experiments. If we further exploit memory level parallelism (MLP) for our task-pair, the task-pair 〈 T, T' 〉 can reach better performance. 展开更多
关键词 chip multi-core processor (CMP) helper-thread pre-computation performance model
下载PDF
基于Tilera平台的AVS视频编码器的研究与实现
16
作者 许昌满 陈兆龙 《电子技术(上海)》 2015年第4期74-77,共4页
Tile处理器是Tilera公司研发的一种新型的多核处理器,文章在介绍Tilera平台的多核处理器的基础上,根据该处理器的架构特点,在该平台上实现了AVS和H.264的标清/高清实时视频编码器。
关键词 AVS tilera多核处理器 视频编码 H.264
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部