期刊文献+
共找到59篇文章
< 1 2 3 >
每页显示 20 50 100
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
1
作者 邓林 窦勇 《Journal of Central South University》 SCIE EI CAS 2011年第2期490-498,共9页
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t... Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores. 展开更多
关键词 multi-core processor NAS parallelization CG memory optimization
下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
下载PDF
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
3
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
下载PDF
System Architecture of Godson-3 Multi-Core Processors 被引量:7
4
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
Parallel computing of discrete element method on multi-core processors 被引量:6
5
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
基于SPEC CPU 2006的国产处理器性能测试设计与分析
6
作者 刘建 李晓静 +2 位作者 刘阳 张明娟 吴宸 《电子质量》 2024年第4期105-110,共6页
通过研究不同架构的国产处理器,介绍了国产处理器发展现状。基于处理器的工作过程,分析了影响处理器性能的内部与外部因素。分别设计使用不同内存容量、不同内存速率与不同版本GCC编译器的试测场景,使用国际权威的CPU性能测试工具SPEC C... 通过研究不同架构的国产处理器,介绍了国产处理器发展现状。基于处理器的工作过程,分析了影响处理器性能的内部与外部因素。分别设计使用不同内存容量、不同内存速率与不同版本GCC编译器的试测场景,使用国际权威的CPU性能测试工具SPEC CPU2006对基于ARM、X86架构的国产处理器的计算速度性能和吞吐量性能进行了测试,并对比基准程序的得分情况,分析不同配置对测试结果的影响。结果显示,内存容量大、速率高对处理器的计算速度性能影响不大,但吞吐量性能表现更好;GCC编译器版本越高,处理器吞吐量性能测试得分越高。 展开更多
关键词 国产处理器 SPEC CPU2006 性能测试 内存容量 速度 计算速度 吞吐量
下载PDF
申威平台高速网络数据处理框架的设计与实现
7
作者 曹建军 佘平 聂世强 《计算机技术与发展》 2024年第7期184-191,共8页
随着大数据时代网络流量的激增,传统内核网络协议栈由于内核切换开销占比高等原因导致现有基于内核的网络数据处理系统无法充分利用10 Gb乃至100 Gb的高速网卡收发能力。为了降低内核切换开销,开源DPDK用户态网络开发套件被提出以支持... 随着大数据时代网络流量的激增,传统内核网络协议栈由于内核切换开销占比高等原因导致现有基于内核的网络数据处理系统无法充分利用10 Gb乃至100 Gb的高速网卡收发能力。为了降低内核切换开销,开源DPDK用户态网络开发套件被提出以支持高速网络流量处理,并在x86平台得到大规模应用和部署。为了满足国产化信创和网络安全的要求,面向国产申威处理器平台设计并实现了一套基于DPDK的网络流量组包解析框架,充分利用DPDK的大页内存、无锁队列等机制,设计多线程并行以发挥申威处理器多核性能,支持常见基于TCP/UDP的多种应用层协议解析,并具有轻量化和可扩展特点。基于真实硬件平台实验结果表明,该框架性能比现有主流软件提高10%左右,为基于国产处理器平台的高速网络数据处理做了初步探索。 展开更多
关键词 DPDK 协议分析 高速网络 TCP/IP协议栈 国产处理器
下载PDF
Energy Efficiency of a Multi-Core Processor by Tag Reduction
8
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
9
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor HETEROGENEITY Representative chip operating point
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
10
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
11
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
国产医用内镜图像处理器及氙灯冷光源的临床效果研究 被引量:7
12
作者 时强 钟芸诗 +5 位作者 顾小舟 王萍 林新林 包寒晶 孔云 姚礼庆 《中国内镜杂志》 北大核心 2015年第1期51-54,共4页
目的分析国产澳华AQ-100型医用内镜图像处理器及AQL-100型氙灯冷光源的临床效果。方法该临床试验采用前瞻性、随机、对照试验设计,将需做电子内镜检查且符合入排选标准的100例患者随机分成试验组与对照组,试验组采用上海澳华光电内镜有... 目的分析国产澳华AQ-100型医用内镜图像处理器及AQL-100型氙灯冷光源的临床效果。方法该临床试验采用前瞻性、随机、对照试验设计,将需做电子内镜检查且符合入排选标准的100例患者随机分成试验组与对照组,试验组采用上海澳华光电内镜有限公司生产的AQ-100型医用内镜图像处理器和AQL-100型氙灯冷光源进行内镜检查,而对照组采用OLYMPUS公司生产的CV-260SL型图像处理装置及CLV-260SL型内镜冷光源进行内镜检查。评价两种内镜系统主要结构、性能等要素是否实质性等同,以及是否具有同样的安全性和有效性。结果试验组与对照组在图像清晰度、色彩定制、血红蛋白强化、图像结构和轮廓强化、冻结回放及存储等图像处理、白平衡、数字放大、分光染色、内镜自动识别、字符显示功能和冷光源气泵性能以及故障率等12个临床评价项目上无实质性差别(P<0.05),具有同样的临床效果和安全性。结论该临床试验产品,可以满足内镜临床诊治的要求。国产内镜与进口设备的巨大的差距正在不断缩小,使内镜的诊疗在基层医院开展成为可能。 展开更多
关键词 内镜 国产 图像处理器 氙灯冷光源 临床试验
下载PDF
国产嵌入式处理器发展综述 被引量:7
13
作者 邓豹 孙靖国 《航空计算技术》 2021年第1期120-124,共5页
处理器技术直接推动着嵌入式计算机的发展。从指令集架构和指令集位数,介绍了CISC和RISC两类指令集架构的特点和典型代表。描述了ARM、MIPS、RISC V架构国内主要嵌入式处理器厂家的发展规划和产品型谱。结合嵌入式应用需求,对主要的处... 处理器技术直接推动着嵌入式计算机的发展。从指令集架构和指令集位数,介绍了CISC和RISC两类指令集架构的特点和典型代表。描述了ARM、MIPS、RISC V架构国内主要嵌入式处理器厂家的发展规划和产品型谱。结合嵌入式应用需求,对主要的处理器进行了详细介绍,选择典型的嵌入式处理器进行了对比测试。对国产嵌入式处理器指令集架构、处理器产品发展进行了简要分析。方法可以指导国产嵌入式处理器的选型,为嵌入式计算机的设计提供参考。 展开更多
关键词 指令集架构 嵌入式处理器 国产处理器 性能测试
下载PDF
基于ZigBee和Android的家用移动监护系统设计与实现 被引量:4
14
作者 郑晓彬 王琛岚 +1 位作者 王忠 陈和恒 《计算机测量与控制》 2015年第8期2706-2708,2712,共4页
为解决传统家用监护系统成本高、移动性差、布线困难等问题,设计并实现了一种基于ZigBee和Android技术的家用移动监护系统;系统采用分层的体系结构,第一层为检测终端,以STM32微处理器为核心,包含多个生理信息采集节点;第二层为家庭网关... 为解决传统家用监护系统成本高、移动性差、布线困难等问题,设计并实现了一种基于ZigBee和Android技术的家用移动监护系统;系统采用分层的体系结构,第一层为检测终端,以STM32微处理器为核心,包含多个生理信息采集节点;第二层为家庭网关,包括ZigBee主节点和计算机服务器,通过ZigBee无线组网技术完成服务器与检测终端的网络通信;第三层为Android移动平台,通过外部网络访问服务器,提供远程交互平台;该系统应用于老年人的家庭健康监护,对体温和血压两项重要生理指标进行测量;实验表明,该系统通信质量良好,能够实现生理信号的采集、传输和波形显示,以及血压值、体温值的分析和高危报警,满足了家用移动监护的要求。 展开更多
关键词 ZIGBEE ANDROID STM32处理器 家用移动监护
下载PDF
国产多核处理器芯片TDBI技术研究 被引量:3
15
作者 翁雷 望正气 曲芳 《环境技术》 2014年第4期66-69,共4页
随着集成电路的功能越来越复杂,超大规模集成电路的动态老炼越来越成为一项困难的工作。传统针对超大规模集成电路的老炼多采用静态老炼方法,这种方法不能使电路内部的功能节点动作起来,无法保证老炼效果,因此能够实现电路内部所有功能... 随着集成电路的功能越来越复杂,超大规模集成电路的动态老炼越来越成为一项困难的工作。传统针对超大规模集成电路的老炼多采用静态老炼方法,这种方法不能使电路内部的功能节点动作起来,无法保证老炼效果,因此能够实现电路内部所有功能模块全动态激励的TDBI技术越来越受到人们的关注。本文以国产多核处理器芯片为测试对象,为研究多核处理器芯片的TDBI方法进行了芯片老炼测试软硬件系统开发,解决了多核处理器老炼中大功率电源供电、图形存储空间不足等关键技术。 展开更多
关键词 国产多核处理器 老炼中测试 超大规模集成电路
下载PDF
国产化计算平台在指挥控制系统中的应用研究 被引量:18
16
作者 程健 吴蔚 《自动化与信息工程》 2011年第3期41-44,共4页
本文阐述了国产化计算平台的概念及主要组成,提出了国产化计算平台在指挥控制系统中的应用设想。最后给出了基于国产化计算平台研发的小型化多雷达点迹航迹混合融合处理设备的设计实现和实际运行效果,验证了本文提出的应用设想。
关键词 国产化 龙芯处理器 麒麟操作系统 指挥控制系统
下载PDF
一种轻量级的处理器核性能分析框架
17
作者 雷国庆 马驰远 +1 位作者 王永文 郑重 《计算机工程与科学》 CSCD 北大核心 2021年第2期199-204,共6页
面向国产处理器核心性能提升的实际需求,针对处理器核RTL设计中可能出现的性能缺陷问题,提出了一种基于RTL仿真的轻量级处理器核性能分析框架。该性能分析框架基于定向和随机测试激励,通过对基准处理器核(Base Core)和新一代处理器核(Ne... 面向国产处理器核心性能提升的实际需求,针对处理器核RTL设计中可能出现的性能缺陷问题,提出了一种基于RTL仿真的轻量级处理器核性能分析框架。该性能分析框架基于定向和随机测试激励,通过对基准处理器核(Base Core)和新一代处理器核(New Core)的RTL设计进行快速模拟仿真,并对模拟结果进行对比分析,从而发现New Core在RTL设计过程中可能引入的性能缺陷。基于该性能分析框架,结合实际应用场景给出了测试方法和测试结果。实践表明,该性能分析框架能够快速对New Core的RTL设计的性能预期进行验证,从而发现New Core在RTL设计过程中可能引入的性能缺陷,有效加速新一代处理器核的研制进程。 展开更多
关键词 国产处理器 处理器核 性能分析框架 RTL仿真
下载PDF
生活垃圾处理机降解温度行为特征的研究
18
作者 吴昊 宗志敏 张赣道 《湖北农业科学》 北大核心 2012年第23期5475-5478,共4页
研究了生活垃圾处理机生物降解生活垃圾降解温度的特征变化。实验表明该体系的温度反应历程表明其能有效地进行生活垃圾的生物降解。自开始投入第一批垃圾起,生活垃圾处理机运行的前60 d为生活垃圾降解反应的初期,这是菌剂中复合微生物... 研究了生活垃圾处理机生物降解生活垃圾降解温度的特征变化。实验表明该体系的温度反应历程表明其能有效地进行生活垃圾的生物降解。自开始投入第一批垃圾起,生活垃圾处理机运行的前60 d为生活垃圾降解反应的初期,这是菌剂中复合微生物在适宜的反应条件下经驯化、复活、繁殖的阶段,其降解反应温度上升速率逐渐增大,温度从35℃逐渐升高至48℃。处理机正常运行第60~120天是生活垃圾降解反应的高速稳定期,温度维持在50℃左右。 展开更多
关键词 生活垃圾 垃圾处理机 生物降解 降解温度
下载PDF
基于国产处理器增量式实时交通流预测算法及实现
19
作者 季一木 杨启凡 +4 位作者 李奎 尤帅 邵思思 刘强 刘尚东 《计算机应用研究》 CSCD 北大核心 2021年第5期1468-1471,共4页
针对城市交通难以处理大量数据且实时性差等问题,提出了根据增量式城市交通流数据预测拥堵情况的一种基于国产处理器的L-BFGS(limited-memory BFGS)算法。该算法通过存储向量序列计算Hessian矩阵,改进Two-Loop算法求下降方向,在Spark集... 针对城市交通难以处理大量数据且实时性差等问题,提出了根据增量式城市交通流数据预测拥堵情况的一种基于国产处理器的L-BFGS(limited-memory BFGS)算法。该算法通过存储向量序列计算Hessian矩阵,改进Two-Loop算法求下降方向,在Spark集群中并行处理时收敛速度快,适用于实时性要求强的城市交通场景。实验结果证明,L-BFGS预测算法完全可以在国产平台上对大规模的实时交通数据流进行快速建模、预测,在改善城市交通管理水平提供有效支撑的同时也丰富了国产芯片的应用领域。 展开更多
关键词 国产处理器 增量式城市交通流数据 Spark集群 L-BFGS算法 交通流预测
下载PDF
面向国产异构平台的OpenMP Offload共享内存访存优化
20
作者 王鑫 李嘉楠 +2 位作者 韩林 赵荣彩 周强伟 《计算机工程与应用》 CSCD 北大核心 2023年第10期75-85,共11页
国产异构处理器DCU(deep computing unit)上的本地数据共享(local data share,LDS)是一种低延迟、高带宽的显式寻址内存。国产异构系统的OpenMP未提供LDS访问的编程接口,导致未有效地利用LDS硬件实现数据的高效访存。针对此问题,研究了... 国产异构处理器DCU(deep computing unit)上的本地数据共享(local data share,LDS)是一种低延迟、高带宽的显式寻址内存。国产异构系统的OpenMP未提供LDS访问的编程接口,导致未有效地利用LDS硬件实现数据的高效访存。针对此问题,研究了面向DCU平台的OpenMP Offload执行模式和LDS的分配方法,以及特定于LDS访存的指令结构,实现了LDS访存的手动支持。另外针对于OpenMP Offload的不同执行模式,在此优化方法的基础上实现了LDS访存的自动化,形成了一套面向国产异构平台的高效访存策略。实验采用polybench标准测试集进行测试,利用手动和自动优化方法在单线程模式下平均加速比可达2.60,利用手动优化方法在多线程non-SPMD模式下平均加速比达1.38,利用自动优化方法在多线程SPMD模式下平均加速比达1.11。实验结果表明LDS访存的自动和手动支持有助于提高OpenMP异构程序运行速度。 展开更多
关键词 国产处理器DCU 本地数据共享(LDS) OpenMP Offlaod SPMD non-SPMD
下载PDF
上一页 1 2 3 下一页 到第
使用帮助 返回顶部