期刊文献+
共找到163篇文章
< 1 2 9 >
每页显示 20 50 100
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
1
作者 邓林 窦勇 《Journal of Central South University》 SCIE EI CAS 2011年第2期490-498,共9页
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t... Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores. 展开更多
关键词 multi-core processor NAS parallelization CG memory optimization
下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
下载PDF
Optimization Task Scheduling Using Cooperation Search Algorithm for Heterogeneous Cloud Computing Systems 被引量:1
3
作者 Ahmed Y.Hamed M.Kh.Elnahary +1 位作者 Faisal S.Alsubaei Hamdy H.El-Sayed 《Computers, Materials & Continua》 SCIE EI 2023年第1期2133-2148,共16页
Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the ... Cloud computing has taken over the high-performance distributed computing area,and it currently provides on-demand services and resource polling over the web.As a result of constantly changing user service demand,the task scheduling problem has emerged as a critical analytical topic in cloud computing.The primary goal of scheduling tasks is to distribute tasks to available processors to construct the shortest possible schedule without breaching precedence restrictions.Assignments and schedules of tasks substantially influence system operation in a heterogeneous multiprocessor system.The diverse processes inside the heuristic-based task scheduling method will result in varying makespan in the heterogeneous computing system.As a result,an intelligent scheduling algorithm should efficiently determine the priority of every subtask based on the resources necessary to lower the makespan.This research introduced a novel efficient scheduling task method in cloud computing systems based on the cooperation search algorithm to tackle an essential task and schedule a heterogeneous cloud computing problem.The basic idea of thismethod is to use the advantages of meta-heuristic algorithms to get the optimal solution.We assess our algorithm’s performance by running it through three scenarios with varying numbers of tasks.The findings demonstrate that the suggested technique beats existingmethods NewGenetic Algorithm(NGA),Genetic Algorithm(GA),Whale Optimization Algorithm(WOA),Gravitational Search Algorithm(GSA),and Hybrid Heuristic and Genetic(HHG)by 7.9%,2.1%,8.8%,7.7%,3.4%respectively according to makespan. 展开更多
关键词 heterogeneous processors cooperation search algorithm task scheduling cloud computing
下载PDF
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
4
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
下载PDF
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
5
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor heterogenEITY Representative chip operating point
原文传递
YHFT-QDSP:High-Performance Heterogeneous Multi-Core DSP
6
作者 陈书明 万江华 +8 位作者 鲁建壮 刘仲 孙海燕 孙永节 刘衡竹 刘祥远 李振涛 徐毅 陈小文 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期214-224,共11页
Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal pro... Multi-core architectures are widely used to in time-to-market and power consumption of the chips enhance the microprocessor performance within a limited increase Toward the application of high-density data signal processing, this paper presents a novel heterogeneous multi-core architecture digital signal processor (DSP), YHFT-QDSP, with one RISC CPU core and 4 VLIW DSP cores. By three kinds of interconnection, YHFT-QDSP provides high efficiency message communication for inner-chip RISC core and DSP cores, inner-chip and inter-chip DSP cores. A parallel programming platform is specifically developed for the heterogeneous nmlti-core architecture of YHFT-QDSP. This parallel programming environment provides a parallel support library and a friendly interface between high level application softwares and multi- core DSP. The 130 nm CMOS custom chip design results benchmarks show that the interconnection structure of in a high speed and moderate power design. The results of typical YHFT-QDSP is much better than other related structures and achieves better speedup when using the interconnection facilities in combing methods. YHFT-QDSP has been signed off and manufactured presently. The future applications of the multi-core chip could be found in 3G wireless base station, high performance radar, industrial applications, and so on. 展开更多
关键词 digital signal processor (DSP) multi-core ARCHITECTURE parallel programming custom design
原文传递
基于Amdahl定律的异构多核密码处理器能效模型研究
7
作者 李伟 郎俊豪 +1 位作者 陈韬 南龙梅 《电子学报》 EI CAS CSCD 北大核心 2024年第3期849-862,共14页
边缘计算安全的资源受限特征及各种新型密码技术的应用,对多核密码处理器的高能效、异构性提出需求,但当前尚缺乏相关的异构多核能效模型研究.本文基于扩展Amdahl定律,引入密码串并特征、异构多核结构、数据准备时间、动态电压频率调节... 边缘计算安全的资源受限特征及各种新型密码技术的应用,对多核密码处理器的高能效、异构性提出需求,但当前尚缺乏相关的异构多核能效模型研究.本文基于扩展Amdahl定律,引入密码串并特征、异构多核结构、数据准备时间、动态电压频率调节等因素,将核划分空闲、活跃状态,建立异构多核密码处理器的能效模型.MATLAB仿真结果表明,数据准备时间占比小于10%时,对能效的负面影响大幅下降;固定电压,频率缩放会影响能效值大小;处理器核空闲/活跃能耗比例越小,能效值越大.架构上,固定异构核,同构核数量与密码任务最大并行度相等时能效值最大,最佳异构核数可由模型变化参数仿真得到;多任务调度执行上,流水与并发执行有利于能效值的进一步提升.多核密码处理器芯片板级测试结果表明,仿真结果与实测数据相关系数接近1,芯片实测的数据准备时间、电压频率缩放等因素的影响与仿真分析基本一致,验证了所提能效模型的有效性.该文重点从影响能效变化趋势因素上,为多核密码处理器异构、高能效设计提供一定的理论分析基础与建议. 展开更多
关键词 密码处理器 多核处理器 异构 AMDAHL定律 能效模型
下载PDF
基于TDA4VM的疲劳状态实时检测系统设计
8
作者 付丽 滕召波 +2 位作者 张一帆 罗钧 王浩程 《实验室研究与探索》 CAS 北大核心 2024年第11期26-30,38,共6页
针对传统嵌入式平台疲劳状态检测系统识别精度低和实时性差的问题,设计了一种基于TDA4VM异构多核处理器的疲劳状态实时检测系统。TDA4VM嵌入式处理器通过摄像头获取图像并进行目标检测,STM32微控制器控制外设模块,包括GPS模块、GSM模块... 针对传统嵌入式平台疲劳状态检测系统识别精度低和实时性差的问题,设计了一种基于TDA4VM异构多核处理器的疲劳状态实时检测系统。TDA4VM嵌入式处理器通过摄像头获取图像并进行目标检测,STM32微控制器控制外设模块,包括GPS模块、GSM模块和语音模块。在目标检测算法方面,先在YOLOX目标检测算法中引入注意力机制模块CBAM(Convolutional Block Attention Module),再对激活函数进行改进,并优化小滑窗替换算法。将训练后的YOLOX模型部署在硬件平台上,实际车载实验结果表明,在不同环境下疲劳状态检测精度可达到95.3%,同时还实现了30帧/s的实时检测。该检测系统具备精度高、实时性强和教学简易等特点,在实验教学和工程应用方面具有一定的参考价值。 展开更多
关键词 疲劳检测 深度学习 异构多核 处理器 YOLOX算法
下载PDF
一种基于异构处理器的可动态布署设计与实现
9
作者 钱宏文 陈光威 《电子技术应用》 2024年第1期93-100,共8页
针对卫星支持的多种生活服务需求实时切换、资源灵活智能调用需求,基于无线广域信号服务异构处理器,设计了一种即时高效、动态切换部署处理器功能的方案。通过对大资源FPGA及多片8核DSP多种功能定制结合动态部署设计,实现实时动态可重... 针对卫星支持的多种生活服务需求实时切换、资源灵活智能调用需求,基于无线广域信号服务异构处理器,设计了一种即时高效、动态切换部署处理器功能的方案。通过对大资源FPGA及多片8核DSP多种功能定制结合动态部署设计,实现实时动态可重构处理器系统功能,将5种FPGA应用结合2种DSP应用程序动态组合,配合各功能任务架构需求重建控制、数据链路,完成多任务智能切换。 展开更多
关键词 异构处理器 动态部署 可重构 FPGA DSP
下载PDF
基于疯狂自适应樽海鞘群优化算法的异构多核任务调度
10
作者 程小辉 刘天承 《计算机与数字工程》 2024年第10期2886-2889,2919,共5页
为了解决当前异构多核环境下的任务调度效率不能满足应用程序的多样性要求的问题,论文基于疯狂自适应的樽海鞘群优化算法(Crazy and Adaptive Salp Swarm Algorithm,CASSA),提出一种异构多核处理器任务调度算法。该算法以缩短全部任务... 为了解决当前异构多核环境下的任务调度效率不能满足应用程序的多样性要求的问题,论文基于疯狂自适应的樽海鞘群优化算法(Crazy and Adaptive Salp Swarm Algorithm,CASSA),提出一种异构多核处理器任务调度算法。该算法以缩短全部任务的完成时间为目标,根据任务优先权规则设计任务分配的编码方案,利用CASSA算法中领导者的全局搜索能力和追随者的局部搜索能力,使CASSA算法在异构多核任务调度问题上有更高的收敛效率和更高质量的解。实验表明,CASSA算法的性能优良,最优解的质量高,在异构多核处理器任务调度领域中具有良好的研究意义。 展开更多
关键词 异构多核处理器 任务调度 疯狂自适应的樽海鞘群优化算法
下载PDF
Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture 被引量:13
11
作者 郑方 李宏亮 +3 位作者 吕晖 过锋 许晓红 谢向辉 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期145-162,共18页
Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which h... Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele- ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS. 展开更多
关键词 heterogeneous many-core processor data stream transfer register-level communication mechanism hardwaresynchronization technique processor prototype
原文传递
面向申威众核处理器的规则处理优化技术
12
作者 张振东 王彤 刘鹏 《计算机研究与发展》 EI CSCD 北大核心 2024年第1期66-85,共20页
高性能口令恢复系统是申威众核处理器的重要应用场景之一,规则处理是主流口令恢复工具中被广泛应用的一种口令生成方式.现有相关研究工作缺少对规则处理算法的优化,导致申威处理器上基于规则的口令生成速度成为口令恢复系统的性能瓶颈.... 高性能口令恢复系统是申威众核处理器的重要应用场景之一,规则处理是主流口令恢复工具中被广泛应用的一种口令生成方式.现有相关研究工作缺少对规则处理算法的优化,导致申威处理器上基于规则的口令生成速度成为口令恢复系统的性能瓶颈.通过分析规则处理算法的多层次可并行性,提出了面向申威众核处理器的线程级、数据级优化方案.在线程级优化方案中,探索了规则处理算法的最优任务映射方式,设计了主从核任务分配机制、从核缓冲区配比优化机制、负载均衡机制、变长规则存储机制等技术以提高并行效率;在数据级优化方案中,分析了规则处理算法中规则函数的计算模式,并通过申威SIMD指令集对规则函数进行向量优化以提高执行效率.在SW26010处理器上的实验结果表明,上述优化方案有效解除了规则处理的性能瓶颈,使规则模式下的口令恢复速度提升了30~101倍. 展开更多
关键词 申威众核处理器 口令恢复 规则处理 异构计算 单指令多数据流
下载PDF
基于轻量级的RISC-V异构处理器的安全模型研究
13
作者 罗云鹏 吴晋成 +1 位作者 王正 王铜柱 《通信技术》 2024年第9期973-980,共8页
面对物联网的快速发展,需要低延时、高性能的处理器来实现关键数据的传输和保护,同时要提高处理器的硬件安全,减少非法用户对处理器的攻击。结合当前开源第五代精简指令集(Reduced Instruction Set Computing-Five,RISC-V)处理器架构优... 面对物联网的快速发展,需要低延时、高性能的处理器来实现关键数据的传输和保护,同时要提高处理器的硬件安全,减少非法用户对处理器的攻击。结合当前开源第五代精简指令集(Reduced Instruction Set Computing-Five,RISC-V)处理器架构优点,与现场可编程门阵列(Field Programmable Gate Array,FPGA)相结合,设计了异构处理器,提出了基于密码的安全启动模型。首先,细化RISC-V异构处理器的体系结构,设计轻量级密码启动安全模型TrustZone,实现处理器性能与安全的平衡,并结合FPGA的优点,实现定制化的专用协议与业务通信。其次,提出当前RISC-V异构处理器可实现的便捷途径,并基于此进行模型搭建和测试验证。验证结果表明,虽然采用TrustZone安全度量后处理器启动时间有所增加,但针对轻量级的处理器应用场景,在增强处理器安全的前提下,该启动时间开销是可以接受的。 展开更多
关键词 RISC-V 异构处理器 可信启动 密码协处理 TrustZone认证
下载PDF
适用于S-NUCA异构处理器的任务调度与热管理系统
14
作者 周义涛 李阳 +3 位作者 韩超 赵玉来 汪玲 李建华 《计算机工程》 CAS CSCD 北大核心 2024年第2期196-205,共10页
异构多核处理器凭借其高性能、低功耗和广泛的应用场景而成为当前计算机平台的主流方案,且大容量的非均匀缓存架构(S-NUCA)具有较低的平均访问时间。然而,不断上升的晶体管规模给异构多核处理器的资源调度和功耗控制带来挑战,传统的调... 异构多核处理器凭借其高性能、低功耗和广泛的应用场景而成为当前计算机平台的主流方案,且大容量的非均匀缓存架构(S-NUCA)具有较低的平均访问时间。然而,不断上升的晶体管规模给异构多核处理器的资源调度和功耗控制带来挑战,传统的调度算法在面对基于S-NUCA的多核处理器时忽略了核心之间的缓存访问延迟,且传统热管理方案只提供芯片级功率约束,容易使得系统因核心使用率降低而造成性能下降。为此,提出一种适用于S-NUCA异构多核系统、满足热安全约束的动态线程调度机制TSCDM。利用基于动态每周期指令(IPC)值的阶段检测技术,并基于人工神经网络预测线程的IPC值,以获取线程与核心类型的最佳绑定关系,依据S-NUCA缓存特性获得最优映射和基于任务分类的任务迁移策略。在此基础上,TSCDM基于片上热模型为每个核心实时分配功率预算。在HotSniper上运行SPLASH-2性能测试套件进行实验,结果表明,相较于传统调度方案与基于机器学习的调度方案,TSCDM在加速比和资源利用率上均表现出优势,TSCDM中使用的基于瞬态温度的安全功率算法相比传统热安全功率算法能够降低核心热余量,同时处理器的全频段均有更高的能效比。 展开更多
关键词 异构多核处理器 人工神经网络 线程调度 阶段检测 热安全功率
下载PDF
System Architecture of Godson-3 Multi-Core Processors 被引量:7
15
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
基于GSLF-SSA的异构多核处理器任务调度
16
作者 刘齐坚 王韦刚 高鹏程 《计算机技术与发展》 2024年第7期48-54,共7页
为了提高异构多核处理器平台的计算性能,从任务调度的角度出发,提出了一种使用黄金正弦和莱维飞行机制改进的麻雀搜索算法(Fusion of Golden Sinusoidal and Levy Flight in Sparrow Search Algorithm,GSLF-SSA)来优化异构多核处理器的... 为了提高异构多核处理器平台的计算性能,从任务调度的角度出发,提出了一种使用黄金正弦和莱维飞行机制改进的麻雀搜索算法(Fusion of Golden Sinusoidal and Levy Flight in Sparrow Search Algorithm,GSLF-SSA)来优化异构多核处理器的任务调度。通过对异构任务调度的分析,将异构任务建模为DAG(Directed Acyclic Graph)任务模型,通过对其优先级进行随机编码分配,实现了GSLF-SSA算法求解域从连续到离散的映射,使该算法更能适用于异构多核任务调度之中。将DAG任务的最优调度长度作为算法的适应度值进行迭代寻优,通过与目前应用广泛的麻雀搜索算法(SSA)、混合式任务调度算法(IHSSA)、人工蜂群算法(ABC)等多种启发式算法在异构任务调度环境下的实验对比表明,GSLF-SSA能获得更优的调度长度与更短的调度执行时间。 展开更多
关键词 异构多核处理器 麻雀搜索算法 有向无环图 任务调度 黄金正弦 莱维飞行
下载PDF
Parallel computing of discrete element method on multi-core processors 被引量:6
17
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
Energy Efficiency of a Multi-Core Processor by Tag Reduction
18
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
19
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
20
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
上一页 1 2 9 下一页 到第
使用帮助 返回顶部