针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立...针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立数学模型,并分析了算法的划分流程。仿真实验结果表明,MT-FTP算法在系统吞吐率方面表现较好,其平均IPC(Instructions Per Cycles)值比UCP(Use Case Point)算法高1.3%,比LRU(Least Recently Used)算法高11.6%。MT-FTP算法对应的系统平均公平性比LRU算法的系统平均公平性高17%,比UCP算法的平均公平性高16.5%。该算法实现了共享Cache划分公平性并兼顾了系统的吞吐率。展开更多
A flexible field programmable gate array based radar signal processor is presented. The radar signal processor mainly consists of five functional modules: radar system timer, binary phase coded pulse compression(PC...A flexible field programmable gate array based radar signal processor is presented. The radar signal processor mainly consists of five functional modules: radar system timer, binary phase coded pulse compression(PC), moving target detection (MTD), constant false alarm rate (CFAR) and target dots processing. Preliminary target dots information is obtained in PC, MTD, and CFAR modules and Nios I! CPU is used for target dots combination and false sidelobe target removing. Sys- tem on programmable chip (SOPC) technique is adopted in the system in which SDRAM is used to cache data. Finally, a FPGA-based binary phase coded radar signal processor is realized and simula- tion result is given.展开更多
There are varieties of embedded systems in the world. It is a big challenge to optimize the instruction sets of System on Chips (SoCs) according to different systems' working environments. The idea of programmable...There are varieties of embedded systems in the world. It is a big challenge to optimize the instruction sets of System on Chips (SoCs) according to different systems' working environments. The idea of programmable instruction set is an effective method to gain embedded system's re-configurability. This letter presents a logic module for Java processor to be capable of using programmable instruction set. Cost (area, power, and timing) of the module is trivial. Such module is also reusable for other embedded system solutions besides Java systems.展开更多
The paper proposes a low power non-volatile baseband processor with wake-up identification(WUI) receiver for LR-WPAN transceiver.It consists of WUI receiver,main receiver,transmitter,non-volatile memory(NVM) and power...The paper proposes a low power non-volatile baseband processor with wake-up identification(WUI) receiver for LR-WPAN transceiver.It consists of WUI receiver,main receiver,transmitter,non-volatile memory(NVM) and power management module.The main receiver adopts a unified simplified synchronization method and channel codec with proactive Reed-Solomon Bypass technique,which increases the robustness and energy efficiency of receiver.The WUI receiver specifies the communication node and wakes up the transceiver to reduce average power consumption of the transceiver.The embedded NVM can backup/restore the states information of processor that avoids the loss of the state information caused by power failure and reduces the unnecessary power of repetitive computation when the processor is waked up from power down mode.The baseband processor is designed and verified on a FPGA board.The simulated power consumption of processor is 5.1uW for transmitting and 28.2μW for receiving.The WUI receiver technique reduces the average power consumption of transceiver remarkably.If the transceiver operates 30 seconds in every 15 minutes,the average power consumption of the transceiver can be reduced by two orders of magnitude.The NVM avoids the loss of the state information caused by power failure and energy waste caused by repetitive computation.展开更多
类脑处理器较深度学习处理器具有能效优势.类脑处理器的片上互连一般采用具有可扩展性高、吞吐量高和通用性高等特点的片上网络.为了解决采用同步片上网络面临的全局时钟树时序难以收敛的问题以及采用异步片上网络面临的链路延迟匹配、...类脑处理器较深度学习处理器具有能效优势.类脑处理器的片上互连一般采用具有可扩展性高、吞吐量高和通用性高等特点的片上网络.为了解决采用同步片上网络面临的全局时钟树时序难以收敛的问题以及采用异步片上网络面临的链路延迟匹配、缺乏电子设计自动化工具实现和验证的问题,提出了一种异步片上网络架构——NosralC,用于构建全局异步局部同步(global asynchronous local synchronous,GALS)的多核类脑处理器.NosralC采用异步链路和同步路由器实现.实验表明,NosralC较同步基线,在4个类脑应用数据集下展现出37.5%~38.9%的功耗降低、5.5%~8.0%的平均延迟降低和36.7%~47.6%的能效提升,同时增加不多于6%的额外资源以及带来较小的性能开销(吞吐量降低0.8%~2.4%).NosralC在现场可编程门阵列(FPGA)上得到了验证,证明了该架构的可实现性.展开更多
基于异构众核处理器的超级计算机已经成为TOP500高性能计算机的主流,BSP、LogP、PRAM等已有并行计算模型均针对基于多核处理器的超级计算机设计,不能满足日益迫切的基于众核架构的超级计算机和应用发展需求.本文面向“神威·太湖之...基于异构众核处理器的超级计算机已经成为TOP500高性能计算机的主流,BSP、LogP、PRAM等已有并行计算模型均针对基于多核处理器的超级计算机设计,不能满足日益迫切的基于众核架构的超级计算机和应用发展需求.本文面向“神威·太湖之光”和神威E级原型系统的众核体系结构特点,提出P-PALN(Parallel-Parallel Access via LDM&NOC)并行计算模型,对于计算节点间的并行,该模型沿用BSP/LogP模型描述;对于计算节点内的众核并行,该模型提供私有存储访问和片上阵列通信的众核并行架构的有效描述PALN,能够协助用户进行众核并行算法设计,并在申威众核处理器硬件设计中指导参数的优化.实验结果表明,该模型可有效指导硬件设计和用户众核编程,从而提高系统和应用的性能.展开更多
随着多核处理器片上集成核数的不断增多,并行任务的调度能力越来越成为制约性能提升的关键因素。文章设计一种面向异构多核计算系统的动态任务调度控制器,主要实现动态监控处理单元的负载情况、动态任务唤醒、乱序任务发射、任务写回安...随着多核处理器片上集成核数的不断增多,并行任务的调度能力越来越成为制约性能提升的关键因素。文章设计一种面向异构多核计算系统的动态任务调度控制器,主要实现动态监控处理单元的负载情况、动态任务唤醒、乱序任务发射、任务写回安全管理等功能;研究一种降低计算任务结果数据回写双倍数据速率(double data rate,DDR)外存储器次数的方法,大幅节省了访存开销,进一步提升了计算性能。仿真及性能测试显示,在典型应用场景下,与已有的无动态调度功能的任务发射控制器相比,实现了显示并行化编程向任务并行的自动化控制过渡,编程友好度显著提高,在不同类型的测试案例中,分别提升了11.3%~37.9%的计算性能。展开更多
文摘针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立数学模型,并分析了算法的划分流程。仿真实验结果表明,MT-FTP算法在系统吞吐率方面表现较好,其平均IPC(Instructions Per Cycles)值比UCP(Use Case Point)算法高1.3%,比LRU(Least Recently Used)算法高11.6%。MT-FTP算法对应的系统平均公平性比LRU算法的系统平均公平性高17%,比UCP算法的平均公平性高16.5%。该算法实现了共享Cache划分公平性并兼顾了系统的吞吐率。
基金Supported by the Ministerial Level Advanced Research Foundation (SP240012)
文摘A flexible field programmable gate array based radar signal processor is presented. The radar signal processor mainly consists of five functional modules: radar system timer, binary phase coded pulse compression(PC), moving target detection (MTD), constant false alarm rate (CFAR) and target dots processing. Preliminary target dots information is obtained in PC, MTD, and CFAR modules and Nios I! CPU is used for target dots combination and false sidelobe target removing. Sys- tem on programmable chip (SOPC) technique is adopted in the system in which SDRAM is used to cache data. Finally, a FPGA-based binary phase coded radar signal processor is realized and simula- tion result is given.
基金Supported by the Guangzhou Key Technology R&D Program (No.2007Z2-D0011)
文摘There are varieties of embedded systems in the world. It is a big challenge to optimize the instruction sets of System on Chips (SoCs) according to different systems' working environments. The idea of programmable instruction set is an effective method to gain embedded system's re-configurability. This letter presents a logic module for Java processor to be capable of using programmable instruction set. Cost (area, power, and timing) of the module is trivial. Such module is also reusable for other embedded system solutions besides Java systems.
基金supported in part by the National Natural Science Foundation of China(No.61306027)
文摘The paper proposes a low power non-volatile baseband processor with wake-up identification(WUI) receiver for LR-WPAN transceiver.It consists of WUI receiver,main receiver,transmitter,non-volatile memory(NVM) and power management module.The main receiver adopts a unified simplified synchronization method and channel codec with proactive Reed-Solomon Bypass technique,which increases the robustness and energy efficiency of receiver.The WUI receiver specifies the communication node and wakes up the transceiver to reduce average power consumption of the transceiver.The embedded NVM can backup/restore the states information of processor that avoids the loss of the state information caused by power failure and reduces the unnecessary power of repetitive computation when the processor is waked up from power down mode.The baseband processor is designed and verified on a FPGA board.The simulated power consumption of processor is 5.1uW for transmitting and 28.2μW for receiving.The WUI receiver technique reduces the average power consumption of transceiver remarkably.If the transceiver operates 30 seconds in every 15 minutes,the average power consumption of the transceiver can be reduced by two orders of magnitude.The NVM avoids the loss of the state information caused by power failure and energy waste caused by repetitive computation.
文摘类脑处理器较深度学习处理器具有能效优势.类脑处理器的片上互连一般采用具有可扩展性高、吞吐量高和通用性高等特点的片上网络.为了解决采用同步片上网络面临的全局时钟树时序难以收敛的问题以及采用异步片上网络面临的链路延迟匹配、缺乏电子设计自动化工具实现和验证的问题,提出了一种异步片上网络架构——NosralC,用于构建全局异步局部同步(global asynchronous local synchronous,GALS)的多核类脑处理器.NosralC采用异步链路和同步路由器实现.实验表明,NosralC较同步基线,在4个类脑应用数据集下展现出37.5%~38.9%的功耗降低、5.5%~8.0%的平均延迟降低和36.7%~47.6%的能效提升,同时增加不多于6%的额外资源以及带来较小的性能开销(吞吐量降低0.8%~2.4%).NosralC在现场可编程门阵列(FPGA)上得到了验证,证明了该架构的可实现性.
文摘基于异构众核处理器的超级计算机已经成为TOP500高性能计算机的主流,BSP、LogP、PRAM等已有并行计算模型均针对基于多核处理器的超级计算机设计,不能满足日益迫切的基于众核架构的超级计算机和应用发展需求.本文面向“神威·太湖之光”和神威E级原型系统的众核体系结构特点,提出P-PALN(Parallel-Parallel Access via LDM&NOC)并行计算模型,对于计算节点间的并行,该模型沿用BSP/LogP模型描述;对于计算节点内的众核并行,该模型提供私有存储访问和片上阵列通信的众核并行架构的有效描述PALN,能够协助用户进行众核并行算法设计,并在申威众核处理器硬件设计中指导参数的优化.实验结果表明,该模型可有效指导硬件设计和用户众核编程,从而提高系统和应用的性能.
文摘随着多核处理器片上集成核数的不断增多,并行任务的调度能力越来越成为制约性能提升的关键因素。文章设计一种面向异构多核计算系统的动态任务调度控制器,主要实现动态监控处理单元的负载情况、动态任务唤醒、乱序任务发射、任务写回安全管理等功能;研究一种降低计算任务结果数据回写双倍数据速率(double data rate,DDR)外存储器次数的方法,大幅节省了访存开销,进一步提升了计算性能。仿真及性能测试显示,在典型应用场景下,与已有的无动态调度功能的任务发射控制器相比,实现了显示并行化编程向任务并行的自动化控制过渡,编程友好度显著提高,在不同类型的测试案例中,分别提升了11.3%~37.9%的计算性能。