期刊文献+
共找到36篇文章
< 1 2 >
每页显示 20 50 100
Shared Cache Based on Content Addressable Memory in a Multi-Core Architecture
1
作者 Allam Abumwais Mahmoud Obaid 《Computers, Materials & Continua》 SCIE EI 2023年第3期4951-4963,共13页
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc... Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors. 展开更多
关键词 multi-core processor shared cache content addressable memory dual port CAM replacement algorithm benchmark program
下载PDF
Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on Multi-Core Processor
2
作者 Zhang Ziran,Li Jun,Li Changxiao(ZTE Corporation,Shenzhen 518057,P.R.China) 《ZTE Communications》 2009年第1期54-58,共5页
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co... The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well. 展开更多
关键词 CORE LTE Parallel Processing Design for LTE PUSCH Demodulation and Decoding Based on multi-core processor Design
下载PDF
多核堆栈处理器研究与设计
3
作者 刘自昂 周永录 +1 位作者 代红兵 刘宏杰 《计算机工程与设计》 北大核心 2024年第4期1256-1263,共8页
为满足日趋复杂的嵌入式环境对堆栈处理器和Forth技术的应用需求,在单核堆栈处理器模型研究的基础上,设计一种多核堆栈处理器模型。基于J1单核堆栈处理器模型,针对多核目标,增加计时器、中断等功能,形成新的L32单核堆栈处理器模型,并以... 为满足日趋复杂的嵌入式环境对堆栈处理器和Forth技术的应用需求,在单核堆栈处理器模型研究的基础上,设计一种多核堆栈处理器模型。基于J1单核堆栈处理器模型,针对多核目标,增加计时器、中断等功能,形成新的L32单核堆栈处理器模型,并以该单核模型为内核,引入共享总线和十字开关互联方式的Wishbone总线、多端口存储器和面向多任务Forth系统的指令集,建立一种多核堆栈处理器模型L32-MC。利用该多核模型,在FPGA上实现4核和8核的L32-MC原型多核堆栈处理器。实验结果表明,4核和8核的L32-MC原型堆栈处理器满足高性能低功耗的多核处理器设计目标。 展开更多
关键词 多核堆栈处理器 Forth技术 Wishbone片上总线 多端口存储器 指令集 现场可编程门阵列 嵌入式
下载PDF
面向飞腾迈创DSP的自主软件栈设计
4
作者 时洋 陈照云 +3 位作者 孙海燕 王耀华 文梅 扈啸 《计算机工程与科学》 CSCD 北大核心 2024年第6期968-976,共9页
飞腾迈创DSP是国防科技大学计算机学院为了突破卡脖子技术,解决我国相关重点领域内芯片长久受制于人的现实问题而自主设计的高性能数字信号处理器。由于该系列芯片采用全自主设计的指令集,无法兼容已有的软件,一套自主完备且高效的软件... 飞腾迈创DSP是国防科技大学计算机学院为了突破卡脖子技术,解决我国相关重点领域内芯片长久受制于人的现实问题而自主设计的高性能数字信号处理器。由于该系列芯片采用全自主设计的指令集,无法兼容已有的软件,一套自主完备且高效的软件栈是决定飞腾迈创DSP生命力的关键。基于团队长期以来的持续工作,系统阐述了飞腾迈创DSP软件栈的设计原则与层次化架构,重点介绍了包括支持层、编译层以及工具层在内的相关软件工具的创新功能、实现方法以及性能。同时,结合用户的反馈与团队的思考,还讨论了飞腾迈创DSP软件栈未来需要探索的相关问题。 展开更多
关键词 DSP 软件栈 编译器 调试器 自主芯片
下载PDF
申威平台高速网络数据处理框架的设计与实现
5
作者 曹建军 佘平 聂世强 《计算机技术与发展》 2024年第7期184-191,共8页
随着大数据时代网络流量的激增,传统内核网络协议栈由于内核切换开销占比高等原因导致现有基于内核的网络数据处理系统无法充分利用10 Gb乃至100 Gb的高速网卡收发能力。为了降低内核切换开销,开源DPDK用户态网络开发套件被提出以支持... 随着大数据时代网络流量的激增,传统内核网络协议栈由于内核切换开销占比高等原因导致现有基于内核的网络数据处理系统无法充分利用10 Gb乃至100 Gb的高速网卡收发能力。为了降低内核切换开销,开源DPDK用户态网络开发套件被提出以支持高速网络流量处理,并在x86平台得到大规模应用和部署。为了满足国产化信创和网络安全的要求,面向国产申威处理器平台设计并实现了一套基于DPDK的网络流量组包解析框架,充分利用DPDK的大页内存、无锁队列等机制,设计多线程并行以发挥申威处理器多核性能,支持常见基于TCP/UDP的多种应用层协议解析,并具有轻量化和可扩展特点。基于真实硬件平台实验结果表明,该框架性能比现有主流软件提高10%左右,为基于国产处理器平台的高速网络数据处理做了初步探索。 展开更多
关键词 DPDK 协议分析 高速网络 TCP/IP协议栈 国产处理器
下载PDF
快速低切换开销的堆栈处理器架构研究与实现
6
作者 郭金辉 代红兵 +1 位作者 周永录 刘宏杰 《计算机工程与设计》 北大核心 2023年第1期292-298,共7页
为解决当前Forth堆栈处理器架构不支持多任务并发和事件实时响应等问题,提出一种快速低切换开销的Forth堆栈处理器架构。在现有Forth堆栈处理器架构的基础上,引入新的指令、定时器、中断机制以及采用多任务堆栈技术,使得该架构支持实时... 为解决当前Forth堆栈处理器架构不支持多任务并发和事件实时响应等问题,提出一种快速低切换开销的Forth堆栈处理器架构。在现有Forth堆栈处理器架构的基础上,引入新的指令、定时器、中断机制以及采用多任务堆栈技术,使得该架构支持实时多任务的运行。实验结果表明,基于堆栈处理器架构的Forth实时多任务调度支持多任务运行,与当前基于寄存器处理器的Forth实时多任务调度相比,实时任务响应、任务上下文切换和最大关中断等时间均明显缩短。 展开更多
关键词 Forth堆栈处理器 新指令 定时器 多任务堆栈技术 实时多任务 寄存器处理器
下载PDF
System Architecture of Godson-3 Multi-Core Processors 被引量:7
7
作者 高翔 陈云霁 +2 位作者 王焕东 唐丹 胡伟武 《Journal of Computer Science & Technology》 SCIE EI CSCD 2010年第2期181-191,共11页
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa... Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem. 展开更多
关键词 multi-core processor scalable interconnection cache coherent non-uniform memory access/non-uniform cache access (CC-NUMA/NUCA) MESH CROSSBAR cache coherence reliability availability and serviceability (RAS)
原文传递
Parallel computing of discrete element method on multi-core processors 被引量:6
8
作者 Yusuke Shigeto Mikio Sakai 《Particuology》 SCIE EI CAS CSCD 2011年第4期398-405,共8页
This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer ... This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors. Recently, multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields. We propose a new algorithm for multi-thread parallel computation of DEM, which makes effective use of the available memory and accelerates the computation. This study shows that memory usage is drastically reduced by using this algorithm. To show the practical use of DEM in industry, a large-scale powder system is simulated with a complicated drive unit. We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor. The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm. In addition, DEM algorithm is shown to have high scalabilitv in a multi-thread parallel computation on a CPU. 展开更多
关键词 Discrete element method Parallel computing multi-core processor GPGPU
原文传递
Energy Efficiency of a Multi-Core Processor by Tag Reduction
9
作者 郑龙 董冕雄 +3 位作者 Kaoru Ota 金海 Song Guo 马俊 《Journal of Computer Science & Technology》 SCIE EI CSCD 2011年第3期491-503,共13页
We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This p... We consider the energy saving problem for caches on a multi-core processor. In the previous research on low power processors, there are various methods to reduce power dissipation. Tag reduction is one of them. This paper extends the tag reduction technique on a single-core processor to a multi-core processor and investigates the potential of energy saving for multi-core processors. We formulate our approach as an equivalent problem which is to find an assignment of the whole instruction pages in the physical memory to a set of cores such that the tag-reduction conflicts for each core can be mostly avoided or reduced. We then propose three algorithms using different heuristics for this assignment problem. We provide convincing experimental results by collecting experimental data from a real operating system instead of the traditional way using a processor simulator that cannot simulate operating system functions and the full memory hierarchy. Experimental results show that our proposed algorithms can save total energy up to 83.93% on an 8-core processor and 76.16% on a 4-core processor in average compared to the one that the tag-reduction is not used for. They also significantly outperform the tag reduction based algorithm on a single-core processor. 展开更多
关键词 tag reduction multi-core processor energy efficiency
原文传递
Schedule refinement for homogeneous multi-core processors in the presence of manufacturing-caused heterogeneity
10
作者 Zhi-xiang CHEN Zhao-lin LI +2 位作者 Shan CAO Fang WANG Jie ZHOU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2015年第12期1018-1033,共16页
Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturin... Multi-core homogeneous processors have been widely used to deal with computation-intensive embedded applications. However, with the continuous down scaling of CMOS technology, within-die variations in the manufacturing process lead to a significant spread in the operating speeds of cores within homogeneous multi-core processors. Task scheduling approaches, which do not consider such heterogeneity caused by within-die variations,can lead to an overly pessimistic result in terms of performance. To realize an optimal performance according to the actual maximum clock frequencies at which cores can run, we present a heterogeneity-aware schedule refining(HASR) scheme by fully exploiting the heterogeneities of homogeneous multi-core processors in embedded domains.We analyze and show how the actual maximum frequencies of cores are used to guide the scheduling. In the scheme,representative chip operating points are selected and the corresponding optimal schedules are generated as candidate schedules. During the booting of each chip, according to the actual maximum clock frequencies of cores, one of the candidate schedules is bound to the chip to maximize the performance. A set of applications are designed to evaluate the proposed scheme. Experimental results show that the proposed scheme can improve the performance by an average value of 22.2%, compared with the baseline schedule based on the worst case timing analysis. Compared with the conventional task scheduling approach based on the actual maximum clock frequencies, the proposed scheme also improves the performance by up to 12%. 展开更多
关键词 Schedule refining multi-core processor HETEROGENEITY Representative chip operating point
原文传递
Thread Private Variable Access Optimization Technique for Sunway High-Performance Multi-core Processors
11
作者 Jinying Kong Kai Nie +2 位作者 Qinglei Zhou Jinlong Xu Lin Han 《国际计算机前沿大会会议论文集》 2021年第1期180-189,共10页
The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow acce... The primary way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processor is to use the OpenMP programming technique.To address the problem of low parallelism efficiency caused by slow accessto thread private variables in the compilation of Sunway OpenMP programs, thispaper proposes a thread private variable access technique based on privilegedinstructions. The privileged instruction-based thread-private variable access techniquecentralizes the implementation of thread-private variables at the compilerlevel, eliminating the model switching overhead of invoking OS core processingand improving the speed of accessing thread-private variables. On the Sunway1621 server platform, NPB3.3-OMP and SPEC OMP2012 achieved 6.2% and6.8% running efficiency gains, respectively. The results show that the techniquesproposed in this paper can provide technical support for giving full play to theadvantages of Sunway’s high-performance multi-core processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Privileged instruction-based thread-private variable access technique Sunway 1621 processor
原文传递
Parallel Region Reconstruction Technique for Sunway High-Performance Multi-core Processors
12
作者 Kai Nie Qinglei Zhou +3 位作者 Hong Qian Jianmin Pang Jinlong Xu Yapeng Li 《国际计算机前沿大会会议论文集》 2021年第1期163-179,共17页
The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by hight... The leading way to achieve thread-level parallelism on the Sunwayhigh-performance multicore processors is to use OpenMP programming techniques.In order to address the problem of low parallel efficiency caused by highthread group control overhead in the compilation of Sunway OpenMP programs,this paper proposes the parallel region reconstruction technique. The parallelregion reconstruction technique expands the parallel scope of parallel regionsin OpenMP programs by parallel region merging and parallel region extending.Moreover, it reduces the number of parallel regions in OpenMP programs,decreases the overhead of frequent creation and convergence of thread groups,and converts standard fork-join model OpenMP programs to higher performanceSPMD modelOpenMP programs. On the Sunway 1621 server computer, NPB3.3-OMP and SPEC OMP2012 achieved 8.9% and 7.9% running efficiency improvementrespectively through parallel region reconstruction technique. As a result,the parallel region reconstruction technique is feasible and effective. It providestechnical support to fully exploit the multi-core parallelism advantage of Sunway’shigh-performance processors. 展开更多
关键词 Sunway high-performance multi-core processors OpenMP programming technique Parallel domain reconstruction technique
原文传递
基于DM9000A的嵌入式以太网接口设计与实现 被引量:26
13
作者 施勇 温阳东 《合肥工业大学学报(自然科学版)》 CAS CSCD 北大核心 2011年第4期519-524,共6页
文章提出了一种基于32位ARM处理器LPC2468和以太网控制器DM9000A的嵌入式以太网接口设计方法。硬件方面主要涉及以太网网络接口电路的设计,软件方面主要涉及以太网控制芯片驱动程序和上层网路协议。该嵌入式系统网络接入方案具有硬件接... 文章提出了一种基于32位ARM处理器LPC2468和以太网控制器DM9000A的嵌入式以太网接口设计方法。硬件方面主要涉及以太网网络接口电路的设计,软件方面主要涉及以太网控制芯片驱动程序和上层网路协议。该嵌入式系统网络接入方案具有硬件接口简单、外围器件少、价格低廉、开发周期短等特点。 展开更多
关键词 嵌入式系统 LPC2468处理器 DM9000A控制器 网络驱动 TCP/IP网络协议栈
下载PDF
多核网络处理器iSCSI发起端研究与实现 被引量:1
14
作者 查奇文 张武 +1 位作者 曾学文 宋毅 《计算机工程》 CAS CSCD 2014年第5期304-308,共5页
针对已有的互联网小型计算机系统接口(iSCSI)发起端实现方式在面向多核网络处理器时存在的性能和扩展性不佳等问题,研究多核网络处理器的网络处理软件框架,提出基于多核网络处理器异构操作系统的网络处理软件框架。采用软件框架和P-SPL... 针对已有的互联网小型计算机系统接口(iSCSI)发起端实现方式在面向多核网络处理器时存在的性能和扩展性不佳等问题,研究多核网络处理器的网络处理软件框架,提出基于多核网络处理器异构操作系统的网络处理软件框架。采用软件框架和P-SPL数据层面编程模型,给出一种iSCSI发起端实现方式。实验结果证明,相比基于Linux操作系统的iSCSI发起端实现,基于多核网络处理器异构操作系统的iSCSI发起端实现在吞吐率和响应时间上都有明显的性能提升。在6个千兆网口的测试环境中,读写吞吐率最高可达到180 MB/s,响应时间最多减少1.6 ms。 展开更多
关键词 互联网小型计算机系统接口 iSCSI发起端 多核网络处理器 LINUX操作系统 TCP IP协议栈 网络处理操作系统
下载PDF
基于SOPC的以太网实时数据采集系统设计与实现 被引量:5
15
作者 梅大成 柴志勇 《计算机应用》 CSCD 北大核心 2009年第B12期108-109,112,共3页
设计了一个基于SOPC技术的实时数据采集系统。系统采用NiosⅡ软核处理器为主控制器,以嵌入式实时操作系统μC/OS-Ⅱ为软件运行平台,以LWIP为以太网通信协议,实现了数据采集系统的以太网传输及控制。整个系统在CycloneⅡ EP2C35开发板上... 设计了一个基于SOPC技术的实时数据采集系统。系统采用NiosⅡ软核处理器为主控制器,以嵌入式实时操作系统μC/OS-Ⅱ为软件运行平台,以LWIP为以太网通信协议,实现了数据采集系统的以太网传输及控制。整个系统在CycloneⅡ EP2C35开发板上实现并通过验证。 展开更多
关键词 NiosⅡ软核处理器 SOPC μC/OS-Ⅱ LWIP协议栈 实时数据采集
下载PDF
分支指令特性与分支预测器的性能研究 被引量:1
16
作者 喻明艳 张祥建 王晨旭 《微电子学与计算机》 CSCD 北大核心 2010年第6期8-12,共5页
根据分支指令的特性,分析了分支行为与分支预测技术对单发射嵌入式处理器CPI栈(CPI stacks)组成的影响,并在RTL级设计了分支预测器的时序精确模型,通过硬件模拟方法对分支指令特性和分支预测器的性能进行了研究.实验考察了分支指令在分... 根据分支指令的特性,分析了分支行为与分支预测技术对单发射嵌入式处理器CPI栈(CPI stacks)组成的影响,并在RTL级设计了分支预测器的时序精确模型,通过硬件模拟方法对分支指令特性和分支预测器的性能进行了研究.实验考察了分支指令在分支预测器命中或缺失时的不同跳转统计特性,验证了分支预测器对CPI栈影响的理论推导,为单发射嵌入式处理器中分支预测器的设计与优化提供了精确的实验依据. 展开更多
关键词 CPI栈 分支预测器 单发射嵌入式处理器 硬件模型
下载PDF
NP防火墙协议栈驱动模块的设计与实现 被引量:1
17
作者 韩志耕 罗军舟 《计算机工程》 EI CAS CSCD 北大核心 2006年第21期136-138,共3页
彻底打通网络处理器光口到本地协议栈间通路需要协议栈驱动提供支持。针对协议栈驱动基本组成和内在驱动机制,同时确保遵循Intel IXA软件架构分层设计原则,该文提出了在Linux平台上的实现方案并进行了分析,指出了实现过程中牵涉的关键... 彻底打通网络处理器光口到本地协议栈间通路需要协议栈驱动提供支持。针对协议栈驱动基本组成和内在驱动机制,同时确保遵循Intel IXA软件架构分层设计原则,该文提出了在Linux平台上的实现方案并进行了分析,指出了实现过程中牵涉的关键技术。Enp2611评估板上硬件光口打通测试表明设计达到了预先要求。 展开更多
关键词 协议栈驱动 防火墙 网络处理器 包分类 主动式安全防范系统
下载PDF
基于NP策略路由中源地址路由功能的设计与实现 被引量:2
18
作者 易著梁 《广西民族大学学报(自然科学版)》 CAS 2013年第3期64-67,共4页
阐述了一种基于网络处理器的源地址路由解决方案.该方案能够在不影响IP报文的承载效率的情况下,透明的实现大容量报文的转发能力,是一种行之有效的方案.
关键词 源地址路由 网络处理器 IP协议栈
下载PDF
C环境下DSP程序存储空间访问技术 被引量:2
19
作者 易龙强 戴瑜兴 《湖南工程学院学报(自然科学版)》 2006年第4期1-3,19,共4页
针对TMS320C2xx系列DSP的C编译器未提供程序存储器数据操作的C运行库函数的问题,介绍了该项技术的解决方法.通过介绍函数功能实现所用汇编指令以及TI的C编译环境软堆栈结构和C语言调用规范,详细描述了C可调用DSP程序存储空间访问技术的... 针对TMS320C2xx系列DSP的C编译器未提供程序存储器数据操作的C运行库函数的问题,介绍了该项技术的解决方法.通过介绍函数功能实现所用汇编指令以及TI的C编译环境软堆栈结构和C语言调用规范,详细描述了C可调用DSP程序存储空间访问技术的程序实现方法.该技术可用于具有大量数据常量的工程应用中,以解决其数据存储单元资源紧缺问题.利用该技术还可以在程序存储空间上开辟一段空间用作非易失性存储空间存储用户掉电保护数据,这样有利于简化系统并提高系统性能.实践证明,该技术具有极高的实用价值. 展开更多
关键词 DSP C编译器 堆栈
下载PDF
基于网络处理器的新型IPv6转发系统的设计与实现
20
作者 苏金树 时向泉 吴纯青 《国防科技大学学报》 EI CAS CSCD 北大核心 2005年第5期6-11,共6页
转发与控制分离结构的提出和网络处理器的发展对路由器的扩展性、灵活性、性能具有重要的影响,而IPv6作为下一代互联网协议的核心,是路由器研究的重要对象。简要阐述了基于转发与控制分离结构ForCES的IPv6路由器的系统结构,重点论述了... 转发与控制分离结构的提出和网络处理器的发展对路由器的扩展性、灵活性、性能具有重要的影响,而IPv6作为下一代互联网协议的核心,是路由器研究的重要对象。简要阐述了基于转发与控制分离结构ForCES的IPv6路由器的系统结构,重点论述了基于网络处理器的IPv6路由器的转发结构、双栈转发系统的流程设计和隧道机制设计的实现,给出IPv6路由器原型系统的实际测试结果。 展开更多
关键词 IPV6 转发与控制分离 网络处理器 双栈 隧道
下载PDF
上一页 1 2 下一页 到第
使用帮助 返回顶部