期刊文献+
共找到24篇文章
< 1 2 >
每页显示 20 50 100
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
1
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
下载PDF
High-security multi-constellation shaping modulation with asymmetric encryption
3
作者 姜蕾 刘博 +7 位作者 任建新 吴翔宇 Rahat Ullah 毛雅亚 陈帅东 马一澜 赵立龙 田凤 《Chinese Optics Letters》 SCIE EI CAS CSCD 2024年第4期14-19,共6页
This Letter proposes a high-security modulation scheme for optical transmission systems.By using multi-constellation shaping and asymmetric encryption,the information security can be enhanced and quantum computer crac... This Letter proposes a high-security modulation scheme for optical transmission systems.By using multi-constellation shaping and asymmetric encryption,the information security can be enhanced and quantum computer cracking can be effectively resisted.Three-dimensional(3D)carrier-less amplitude phase modulation is utilized to superposition and transmit 3D signals.Experimental verification is conducted using a seven-core weakly coupled fiber platform.The results demonstrate that the proposed scheme can effectively protect the system from any illegal attacker. 展开更多
关键词 multi-core transmission asymmetric encryption multi-constellation shaping
原文传递
System Architecture of Godson-3 Multi-Core Processors 被引量:7
4
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
Parallel computing of discrete element method on multi-core processors 被引量:6
5
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
Energy Efficiency of a Multi-Core Processor by Tag Reduction
6
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
7
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor HETEROGENEITY Representative chip operating point
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
8
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
9
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
片上异构双PowerPC雷达控制器的设计与应用 被引量:1
10
作者 施海锋 柏玉娴 《现代雷达》 CSCD 北大核心 2014年第6期35-38,44,共5页
针对Virtex-5 FXT系列FPGA中具有两个PowerPC440嵌入式处理器内核的特点,文中提出一种"主-从"异构式控制模型架构的嵌入式雷达控制器设计方法。该方法采用FC、sRIO等高速串行传输技术提高了控制器接口带宽,并通过预先任务规划... 针对Virtex-5 FXT系列FPGA中具有两个PowerPC440嵌入式处理器内核的特点,文中提出一种"主-从"异构式控制模型架构的嵌入式雷达控制器设计方法。该方法采用FC、sRIO等高速串行传输技术提高了控制器接口带宽,并通过预先任务规划,充分发挥了两个PowerPC处理器的性能,设计成本与已有解决方案相比显著降低。应用表明,该控制器整体性能明显提高,可满足现代相控阵雷达提出的微秒级响应与吉比特级传输要求。 展开更多
关键词 雷达控制器 多核处理器 异构架构 非对称多处理 光纤通道
下载PDF
一种AMP架构下的处理器负载均衡改进方法 被引量:1
11
作者 蒋建军 刘彤 《山东农业大学学报(自然科学版)》 CSCD 2015年第1期96-100,共5页
在多核异构方式下,由于数据流大小差异很大,在检测处理器检测数据流时耗时较长,导致了检测处理器之间的负载处于不均衡状态。同时,在网络处理器和检测处理器之间由于是固定任务分配,不可能做到完全均衡。针对存在的这些缺陷,提出了循环... 在多核异构方式下,由于数据流大小差异很大,在检测处理器检测数据流时耗时较长,导致了检测处理器之间的负载处于不均衡状态。同时,在网络处理器和检测处理器之间由于是固定任务分配,不可能做到完全均衡。针对存在的这些缺陷,提出了循环工作队列方法,可以动态感知处理器负载均衡态势,改进了检测处理器均衡方法,进一步提高了检测处理器的性能发挥,解决了网络处理器和检测处理器之间无法均衡的问题,提升了系统的整体性能。 展开更多
关键词 IPS AMP构架 异构方式 检测处理器 负载均衡
下载PDF
一种非对称多核SDR的任务调度和分配算法
12
作者 徐力 史少波 《计算机工程》 CAS CSCD 2014年第1期83-87,97,共6页
针对软件无线电(SDR)应用同步数据流的特点,提出一种非对称多核SDR的任务调度和分配算法。该算法综合考虑任务之间的通信时间和任务固定流水,保证任务调度和分配的通用性和并行性。利用整数线性规划(ILP)方法对任务调度和分配进行建模,... 针对软件无线电(SDR)应用同步数据流的特点,提出一种非对称多核SDR的任务调度和分配算法。该算法综合考虑任务之间的通信时间和任务固定流水,保证任务调度和分配的通用性和并行性。利用整数线性规划(ILP)方法对任务调度和分配进行建模,采用任务拆分方法优化调度和分配的结果,进一步提高任务调度和分配的执行效率。在目标SDR平台上实现IEEE 802.11a频偏估计处理的任务调度和分配,实验结果表明,该算法能提高5.97%的软件无线电平台吞吐量和3.03%的处理器核平均利用率,并减少34.31%的处理器核最长空闲等待时间。 展开更多
关键词 任务调度和分配 软件无线电 非对称多核处理器 整数线性规划 数字信号处理
下载PDF
一种面向非对称多核处理器的虚拟机集成调度算法 被引量:2
13
作者 陈锐忠 齐德昱 +1 位作者 林伟伟 李剑 《计算机学报》 EI CSCD 北大核心 2014年第7期1466-1477,共12页
在计算机体系结构领域,非对称多核处理器将成为未来的主流.对于非对称多核处理器上的虚拟处理器调度问题,现有研究缺乏理论分析,且没有考虑虚拟处理器的同步特性.针对该问题,文中首先建立非线性规划模型,分析得出全面考虑虚拟处理器同... 在计算机体系结构领域,非对称多核处理器将成为未来的主流.对于非对称多核处理器上的虚拟处理器调度问题,现有研究缺乏理论分析,且没有考虑虚拟处理器的同步特性.针对该问题,文中首先建立非线性规划模型,分析得出全面考虑虚拟处理器同步特性、核心非对称性以及核心负载的调度原则.然后,基于调度原则提出一个集成调度算法,该算法定义了效用因子、比例系数、比例资源的概念,结合虚拟处理器的同步特性和核心的非对称性对资源和负载进行全面度量;同时通过运行队列分解降低调度开销.提出的算法是第一个在非对称多核处理器上利用虚拟处理器同步特性的调度算法.实际平台上的实验表明:该算法实现了公平调度,并且性能比其他同类算法提高19%~48%. 展开更多
关键词 云计算 虚拟化 非对称多核处理器 虚拟处理器调度 负载均衡
下载PDF
性能非对称多核处理器上的自适应调度 被引量:1
14
作者 聂鹏程 段振华 +1 位作者 田聪 杨孟飞 《计算机学报》 EI CSCD 北大核心 2013年第4期773-781,共9页
现有的性能非对称多核调度算法要么不能充分利用其体系结构而吞吐量低,要么能充分利用其体系结构但扩展性差.有些算法即使考虑了扩展性,但也局限于CPU核数目,没有考虑到任务数方面的扩展性.为了解决这些问题,作者提出了一个自适应调度算... 现有的性能非对称多核调度算法要么不能充分利用其体系结构而吞吐量低,要么能充分利用其体系结构但扩展性差.有些算法即使考虑了扩展性,但也局限于CPU核数目,没有考虑到任务数方面的扩展性.为了解决这些问题,作者提出了一个自适应调度算法(称为AS4AMS).在任务的每一次调度中,AS4AMS首先通过分析任务运行时的平均停驻时间得出任务的计算需求,然后根据这些需求以及各CPU核的负载情况将任务分配到合适的CPU核上运行.另外,该算法任务结束前,会不断重复上述过程以适应任务需求的不断变化.实验结果表明:与现有方法相比,所提出的方法扩展性更好并且吞吐量也更大. 展开更多
关键词 多核处理器 性能非对称 操作系统 调度
下载PDF
一种面向非对称多核处理器的综合性调度算法 被引量:2
15
作者 陈锐忠 齐德昱 +1 位作者 林伟伟 李剑 《软件学报》 EI CSCD 北大核心 2013年第2期343-357,共15页
在非对称多核处理器上进行任务调度时,现有的操作系统调度器没有考虑其非对称性.针对单一指令集非对称多核处理器上的操作系统调度问题,首先建立线性规划模型,分析各种因素,得出行为匹配、减少迁移和负载均衡的调度原则.然后,基... 在非对称多核处理器上进行任务调度时,现有的操作系统调度器没有考虑其非对称性.针对单一指令集非对称多核处理器上的操作系统调度问题,首先建立线性规划模型,分析各种因素,得出行为匹配、减少迁移和负载均衡的调度原则.然后,基于调度原则提出一种综合性调度算法.该算法包括两个部分:1)集成负载表征,提出集成行为的概念,全面衡量任务的整体性和阶段性行为;2)基于集成行为的调度算法,有效开发非对称多核处理器的特性,能够保证各核心负载均衡,同时可以避免不必要的任务迁移.另外,该算法通过参数调整机制实现了算法的通用性.该算法是一种综合处理任务的整体性和阶段性行为,并具备通用性的调度算法.实际平台上的实验结果表明,该算法可通用于多种环境,且性能比其他对应算法提高6%-22%. 展开更多
关键词 非对称多核处理器 操作系统调度 负载表征 负载均衡 任务迁移
下载PDF
非对称多核处理器上的操作系统集成调度 被引量:2
16
作者 陈锐忠 齐德昱 +1 位作者 林伟伟 李剑 《计算机学报》 EI CSCD 北大核心 2012年第3期616-626,共11页
相对于对称多核处理器,非对称多核处理器具有更高的效能,将成为未来并行操作系统中的主流体系结构.对于非对称多核处理器上操作系统的并行任务调度问题,现有的研究假设所有核心频率恒定,缺乏理论分析,也没有考虑算法的效能和通用性.针... 相对于对称多核处理器,非对称多核处理器具有更高的效能,将成为未来并行操作系统中的主流体系结构.对于非对称多核处理器上操作系统的并行任务调度问题,现有的研究假设所有核心频率恒定,缺乏理论分析,也没有考虑算法的效能和通用性.针对该问题,该文首先建立非线性规划模型,分析得出全面考虑并行任务同步特性、核心非对称性以及核心负载的调度原则.然后,基于调度原则提出一个集成调度算法,该算法通过集成线程调度和动态电压频率调整来提高效能,并通过参数调整机制实现了算法的通用性.提出的算法是第一个在非对称多核处理器上结合线程调度和动态电压频率调整的调度算法.实际平台上的实验表明:该算法可适用于多种环境,且效能比其他同类算法高24%~50%. 展开更多
关键词 绿色计算 非对称多核处理器 操作系统调度 并行任务调度 动态电压频率调整 负载均衡
下载PDF
非对称多核体系下的阿姆达尔定律性能模型研究 被引量:2
17
作者 冯叶 邓倩妮 《微电子学与计算机》 CSCD 北大核心 2011年第8期32-34,38,共4页
阿姆达尔定律研究并行计算的性能并指出任务的极限所在,为了突破这一瓶颈而提出的非对称多核硬件架构则需要适合的性能模型来支撑其理论价值.新的量化模型引入了具体的核配置等参数,通过分析比对两种架构下的加速比,得到了任务性能在非... 阿姆达尔定律研究并行计算的性能并指出任务的极限所在,为了突破这一瓶颈而提出的非对称多核硬件架构则需要适合的性能模型来支撑其理论价值.新的量化模型引入了具体的核配置等参数,通过分析比对两种架构下的加速比,得到了任务性能在非对称体系上优于对称架构的结论.该模型同时发现,遵循一定的设计规则来转换对称体系至非对称架构可以很大程度上优化整个系统. 展开更多
关键词 阿姆达尔定律 非对称多核体系架构 并行计算模型 系统结构
下载PDF
嵌入式非对称多核并行软件设计 被引量:1
18
作者 李志远 赵元富 兰利东 《微电子学与计算机》 CSCD 北大核心 2013年第8期107-111,共5页
多核处理器结构已经从通用计算领域延伸至嵌入式计算领域,并成为嵌入式处理器主流发展形式.为了解决传统串行软件无法有效利用嵌入式非对称多核处理器计算资源的问题,提高嵌入式多核处理器的性能表现,对非对称多核结构和相应的并行方式... 多核处理器结构已经从通用计算领域延伸至嵌入式计算领域,并成为嵌入式处理器主流发展形式.为了解决传统串行软件无法有效利用嵌入式非对称多核处理器计算资源的问题,提高嵌入式多核处理器的性能表现,对非对称多核结构和相应的并行方式进行研究.针对嵌入式非对称多核处理器的特殊结构,提出了组件化设计的混合并行软件,建立了非对称多核处理器的并行执行环境,可以充分利用系统的计算资源,提升系统计算性能. 展开更多
关键词 非对称多核处理器 并行软件 嵌入式系统 组件化软件
下载PDF
基于多核处理器的非对称嵌入式系统研究综述 被引量:4
19
作者 瞿伟 余飞鸿 《计算机科学》 CSCD 北大核心 2021年第S01期538-542,共5页
随着嵌入式系统的发展与不断分化,很多领域如工业控制、机器人、视频图像系统等对嵌入式系统的要求越来越高,这不仅需要良好的功能扩展性和维护性,还需要保证专有任务的特性(如实时性等)。基于多核处理器的非对称嵌入式系统是解决这些... 随着嵌入式系统的发展与不断分化,很多领域如工业控制、机器人、视频图像系统等对嵌入式系统的要求越来越高,这不仅需要良好的功能扩展性和维护性,还需要保证专有任务的特性(如实时性等)。基于多核处理器的非对称嵌入式系统是解决这些问题的重要发展方向。根据处理器各核心地位是否相同,多核处理器可以划分为同构和异构两种结构。基于同构或者异构多核处理器均可以实现非对称嵌入式系统,从硬件或者软件层面将多核处理器的核心进行划分,分别运行不同的任务,使得嵌入式系统可以兼顾良好的功能扩展性和实时性。文中总结并对比了基于多核处理器的非对称嵌入式系统的研究现状,同时归纳了其在科研和工程领域的应用,最后研究了基于多核处理器的非对称嵌入式系统未来可能的发展方向。 展开更多
关键词 非对称多处理 多核处理器 双操作系统 嵌入式系统
下载PDF
OpenMDSP:Extending OpenMP to Program Multi-Core DSPs 被引量:1
20
作者 何江舟 陈文光 +3 位作者 陈光日 郑纬民 汤志忠 叶寒栋 《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第2期316-331,共16页
Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with gene... Multi-core digital signal processors (DSPs) are widely used in wireless telecommunication, core network transcoding, industrial control, and audio/video processing technologies, among others. In comparison with general-purpose multi-processors, multi-core DSPs normally have a more complex memory hierarchy, such as on-chip core-local memory and non-cache-coherent shared memory. As a result, efficient multi-core DSP applications are very difficult to write. The current approach used to program multi-core DSPs is based on proprietary vendor software development kits (SDKs), which only provide low-level, non-portable primitives. While it is acceptable to write coarse-grained task-level parallel code with these SDKs, writing fine-grained data parallel code with SDKs is a very tedious and error-prone approach. We believe that it is desirable to possess a high-level and portable parallel programming model for multi-core DSPs. In this paper, we propose OpenMDSP, an extension of OpenMP designed for multi-core DSPs. The goal of OpenMDSP is to fill the gap between the OpenMP memory model and the memory hierarchy of multi-core DSPs. We propose three classes of directives in OpenMDSP, including 1) data placement directives that allow programmers to control the placement of global variables conveniently, 2) distributed array directives that divide a whole array into sections and promote the sections into core-local memory to improve performance, and 3) stream access directives that promote big arrays into core-local memory section by section during parallel loop processing while hiding the latency of data movement by the direct memory access (DMA) of a DSP. We implement the compiler and runtime system for OpenMDSP on PreeScale MSC8156. The benchmarking results show that seven of nine benchmarks achieve a speedup of more than a factor of 5 when using six threads. 展开更多
关键词 OPENMP multi-core digital signal processor data parallelism Long Term Evolution
原文传递
上一页 1 2 下一页 到第
使用帮助 返回顶部