多核结构上采用由用户显式制导的并行程序设计模型,使用锁和同步变量来实现同步。事务存储模型能够解决由锁机制带来的一系列问题,提高程序的并发性。介绍了在文中提出的一种基于事务存储模型的多核结构(Transactional-Memory based Chi...多核结构上采用由用户显式制导的并行程序设计模型,使用锁和同步变量来实现同步。事务存储模型能够解决由锁机制带来的一系列问题,提高程序的并发性。介绍了在文中提出的一种基于事务存储模型的多核结构(Transactional-Memory based Chip Multiple-Superscaler,TMCMS)上的并行编程模型,以及针对循环程序的执行模型;以FFT程序为例具体介绍了循环结构的并行化方法和编译转换过程。在初步的实验中,将处理单元从1增加到16个时,在所设计的编程模型的支持下,IPC(Instruction PerCycle)有接近线性的增长,说明该并行编程模型能够充分发掘程序中潜在的细粒度线程级并行性,同时保持并行程序设计的简单性。展开更多
The high-speed computational performance is gained at the cost of huge hardware resource,which restricts the application of high-accuracy algorithms because of the limited hardware cost in practical use.To solve the p...The high-speed computational performance is gained at the cost of huge hardware resource,which restricts the application of high-accuracy algorithms because of the limited hardware cost in practical use.To solve the problem,a novel method for designing the field programmable gate array(FPGA)-based non-uniform rational B-spline(NURBS) interpolator and motion controller,which adopts the embedded multiprocessor technique,is proposed in this study.The hardware and software design for the multiprocessor,one of which is for NURBS interpolation and the other for position servo control,is presented.Performance analysis and experiments on an X-Y table are carried out,hardware cost as well as consuming time for interpolation and motion control is compared with the existing methods.The experimental and comparing results indicate that,compared with the existing methods,the proposed method can reduce the hardware cost by 97.5% using higher-accuracy interpolation algorithm within the period of 0.5 ms.A method which ensures the real-time performance and interpolation accuracy,and reduces the hardware cost significantly is proposed,and it’s practical in the use of industrial application.展开更多
Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core sc...Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core scenario based on the last level shared cache, this paper studies its performance stable condi- tions. Unfortunately, there is no existing model that allows extensive investigation of the impact of stable conditions, we present the base of pre-computation that is formalized by our degraded task-pair 〈 T, T' 〉 with the helper-thread, and its stable conditions are analyzed. Finally, a novel performance model and a constructing method of pre-computation based on our positive degraded task-pair are proposed. The efficient results are shown by our experiments. If we further exploit memory level parallelism (MLP) for our task-pair, the task-pair 〈 T, T' 〉 can reach better performance.展开更多
Single-chip multiprocessor (CMP) combined with the fault-loleranl(FT)techniques offers an ideal architecture to achieve high availability on the basis of sustaining highcomputing performance FT design of a single-chip...Single-chip multiprocessor (CMP) combined with the fault-loleranl(FT)techniques offers an ideal architecture to achieve high availability on the basis of sustaining highcomputing performance FT design of a single-chip multiprocessor is described, including thetechniques from hard-wart redundancy to software support and firmware strategy. The design aims atmasking the influences of errors and automatically correcting the system states.展开更多
Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that m...Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that matches the needs of the target application is always critical to the overall performance. Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs). However, the quest for high performance networks has led to very complex and resource-expensive NoC designs, leaving little room for the real computing force, i.e., PEs. Moreover, many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers. We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs. This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs. We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology, but also the resource requirement of each router. Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network, The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.展开更多
As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocess...As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.展开更多
Increasing the life span and efficiency of Multiprocessor System on Chip(MPSoC)by reducing power and energy utilization has become a critical chip design challenge for multiprocessor systems.With the advancement of te...Increasing the life span and efficiency of Multiprocessor System on Chip(MPSoC)by reducing power and energy utilization has become a critical chip design challenge for multiprocessor systems.With the advancement of technology,the performance management of central processing unit(CPU)is changing.Power densities and thermal effects are quickly increasing in multi-core embedded technologies due to shrinking of chip size.When energy consumption reaches a threshold that creates a delay in complementary metal oxide semiconductor(CMOS)circuits and reduces the speed by 10%–15%because excessive on-chip temperature shortens the chip’s life cycle.In this paper,we address the scheduling&energy utilization problem by introducing and evaluating an optimal energy-aware earliest deadline first scheduling(EA-EDF)based technique formultiprocessor environments with task migration that enhances the performance and efficiency in multiprocessor systemon-chip while lowering energy and power consumption.The selection of core andmigration of tasks prevents the system from reaching itsmaximumenergy utilization while effectively using the dynamic power management(DPM)policy.Increase in the execution of tasks the temperature and utilization factor(u_(i))on-chip increases that dissipate more power.The proposed approach migrates such tasks to the core that produces less heat and consumes less power by distributing the load on other cores to lower the temperature and optimizes the duration of idle and sleep times across multiple CPUs.The performance of the EA-EDF algorithm was evaluated by an extensive set of experiments,where excellent results were reported when compared to other current techniques,the efficacy of the proposed methodology reduces the power and energy consumption by 4.3%–4.7%on a utilization of 6%,36%&46%at 520&624 MHz operating frequency when particularly in comparison to other energy-aware methods for MPSoCs.Tasks are running and accurately scheduled to make an energy-efficient processor by controlling and managing the thermal effects on-chip and optimizing the energy consumption of MPSoCs.展开更多
Minimizing the energy consumption to increase the life span and performance of multiprocessor system on chip(MPSoC)has become an integral chip design issue for multiprocessor systems.The performance measurement of com...Minimizing the energy consumption to increase the life span and performance of multiprocessor system on chip(MPSoC)has become an integral chip design issue for multiprocessor systems.The performance measurement of computational systems is changing with the advancement in technology.Due to shrinking and smaller chip size power densities onchip are increasing rapidly that increasing chip temperature in multi-core embedded technologies.The operating speed of the device decreases when power consumption reaches a threshold that causes a delay in complementary metal oxide semiconductor(CMOS)circuits because high on-chip temperature adversely affects the life span of the chip.In this paper an energy-aware dynamic power management technique based on energy aware earliest deadline first(EA-EDF)scheduling is proposed for improving the performance and reliability by reducing energy and power consumption in the system on chip(SOC).Dynamic power management(DPM)enables MPSOC to reduce power and energy consumption by adopting a suitable core configuration for task migration.Task migration avoids peak temperature values in the multicore system.High utilization factor(ui)on central processing unit(CPU)core consumes more energy and increases the temperature on-chip.Our technique switches the core bymigrating such task to a core that has less temperature and is in a low power state.The proposed EA-EDF scheduling technique migrates load on different cores to attain stability in temperature among multiple cores of the CPU and optimized the duration of the idle and sleep periods to enable the low-temperature core.The effectiveness of the EA-EDF approach reduces the utilization and energy consumption compared to other existing methods and works.The simulation results show the improvement in performance by optimizing 4.8%on u_(i) 9%,16%,23%and 25%at 520 MHz operating frequency as compared to other energy-aware techniques for MPSoCs when the least number of tasks is in running state and can schedule more tasks to make an energy-efficient processor by controlling and managing the energy consumption of MPSoC.展开更多
文摘多核结构上采用由用户显式制导的并行程序设计模型,使用锁和同步变量来实现同步。事务存储模型能够解决由锁机制带来的一系列问题,提高程序的并发性。介绍了在文中提出的一种基于事务存储模型的多核结构(Transactional-Memory based Chip Multiple-Superscaler,TMCMS)上的并行编程模型,以及针对循环程序的执行模型;以FFT程序为例具体介绍了循环结构的并行化方法和编译转换过程。在初步的实验中,将处理单元从1增加到16个时,在所设计的编程模型的支持下,IPC(Instruction PerCycle)有接近线性的增长,说明该并行编程模型能够充分发掘程序中潜在的细粒度线程级并行性,同时保持并行程序设计的简单性。
基金supported by National Key Basic Research Program of China(973 ProgramGrant No.2011CB706804)+1 种基金Shanghai Municipal Science and Technology Commission of China(Grant No.11QH1401400)Research Project of State Key Laboratory of Mechanical System & Vibration of China(Grant No.MSVMS201102)
文摘The high-speed computational performance is gained at the cost of huge hardware resource,which restricts the application of high-accuracy algorithms because of the limited hardware cost in practical use.To solve the problem,a novel method for designing the field programmable gate array(FPGA)-based non-uniform rational B-spline(NURBS) interpolator and motion controller,which adopts the embedded multiprocessor technique,is proposed in this study.The hardware and software design for the multiprocessor,one of which is for NURBS interpolation and the other for position servo control,is presented.Performance analysis and experiments on an X-Y table are carried out,hardware cost as well as consuming time for interpolation and motion control is compared with the existing methods.The experimental and comparing results indicate that,compared with the existing methods,the proposed method can reduce the hardware cost by 97.5% using higher-accuracy interpolation algorithm within the period of 0.5 ms.A method which ensures the real-time performance and interpolation accuracy,and reduces the hardware cost significantly is proposed,and it’s practical in the use of industrial application.
文摘Helper-thread of a task can hide the memory access time of irregular data on the chip muhi-core processor (CMP). For constructing a compiler that effectively supports the helper-thread of a task in the multi-core scenario based on the last level shared cache, this paper studies its performance stable condi- tions. Unfortunately, there is no existing model that allows extensive investigation of the impact of stable conditions, we present the base of pre-computation that is formalized by our degraded task-pair 〈 T, T' 〉 with the helper-thread, and its stable conditions are analyzed. Finally, a novel performance model and a constructing method of pre-computation based on our positive degraded task-pair are proposed. The efficient results are shown by our experiments. If we further exploit memory level parallelism (MLP) for our task-pair, the task-pair 〈 T, T' 〉 can reach better performance.
基金Supported by the National High Techology Devel opment 863 Program of China(2002AA1Z030) and China PostdoctoralScience Foundation(2003034151)
文摘Single-chip multiprocessor (CMP) combined with the fault-loleranl(FT)techniques offers an ideal architecture to achieve high availability on the basis of sustaining highcomputing performance FT design of a single-chip multiprocessor is described, including thetechniques from hard-wart redundancy to software support and firmware strategy. The design aims atmasking the influences of errors and automatically correcting the system states.
文摘Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip. For multiprocessors, an efficient communication network that matches the needs of the target application is always critical to the overall performance. Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs). However, the quest for high performance networks has led to very complex and resource-expensive NoC designs, leaving little room for the real computing force, i.e., PEs. Moreover, many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers. We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs. This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs. We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology, but also the resource requirement of each router. Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network, The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.
基金supported by the National Natural Science Foundation of China under Grant Nos.60970002,60833004,60773146, and 60673145.
文摘As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.
文摘Increasing the life span and efficiency of Multiprocessor System on Chip(MPSoC)by reducing power and energy utilization has become a critical chip design challenge for multiprocessor systems.With the advancement of technology,the performance management of central processing unit(CPU)is changing.Power densities and thermal effects are quickly increasing in multi-core embedded technologies due to shrinking of chip size.When energy consumption reaches a threshold that creates a delay in complementary metal oxide semiconductor(CMOS)circuits and reduces the speed by 10%–15%because excessive on-chip temperature shortens the chip’s life cycle.In this paper,we address the scheduling&energy utilization problem by introducing and evaluating an optimal energy-aware earliest deadline first scheduling(EA-EDF)based technique formultiprocessor environments with task migration that enhances the performance and efficiency in multiprocessor systemon-chip while lowering energy and power consumption.The selection of core andmigration of tasks prevents the system from reaching itsmaximumenergy utilization while effectively using the dynamic power management(DPM)policy.Increase in the execution of tasks the temperature and utilization factor(u_(i))on-chip increases that dissipate more power.The proposed approach migrates such tasks to the core that produces less heat and consumes less power by distributing the load on other cores to lower the temperature and optimizes the duration of idle and sleep times across multiple CPUs.The performance of the EA-EDF algorithm was evaluated by an extensive set of experiments,where excellent results were reported when compared to other current techniques,the efficacy of the proposed methodology reduces the power and energy consumption by 4.3%–4.7%on a utilization of 6%,36%&46%at 520&624 MHz operating frequency when particularly in comparison to other energy-aware methods for MPSoCs.Tasks are running and accurately scheduled to make an energy-efficient processor by controlling and managing the thermal effects on-chip and optimizing the energy consumption of MPSoCs.
文摘Minimizing the energy consumption to increase the life span and performance of multiprocessor system on chip(MPSoC)has become an integral chip design issue for multiprocessor systems.The performance measurement of computational systems is changing with the advancement in technology.Due to shrinking and smaller chip size power densities onchip are increasing rapidly that increasing chip temperature in multi-core embedded technologies.The operating speed of the device decreases when power consumption reaches a threshold that causes a delay in complementary metal oxide semiconductor(CMOS)circuits because high on-chip temperature adversely affects the life span of the chip.In this paper an energy-aware dynamic power management technique based on energy aware earliest deadline first(EA-EDF)scheduling is proposed for improving the performance and reliability by reducing energy and power consumption in the system on chip(SOC).Dynamic power management(DPM)enables MPSOC to reduce power and energy consumption by adopting a suitable core configuration for task migration.Task migration avoids peak temperature values in the multicore system.High utilization factor(ui)on central processing unit(CPU)core consumes more energy and increases the temperature on-chip.Our technique switches the core bymigrating such task to a core that has less temperature and is in a low power state.The proposed EA-EDF scheduling technique migrates load on different cores to attain stability in temperature among multiple cores of the CPU and optimized the duration of the idle and sleep periods to enable the low-temperature core.The effectiveness of the EA-EDF approach reduces the utilization and energy consumption compared to other existing methods and works.The simulation results show the improvement in performance by optimizing 4.8%on u_(i) 9%,16%,23%and 25%at 520 MHz operating frequency as compared to other energy-aware techniques for MPSoCs when the least number of tasks is in running state and can schedule more tasks to make an energy-efficient processor by controlling and managing the energy consumption of MPSoC.