Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support,...Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.展开更多
Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture t...Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture that exploits parallelisms in modern applications. However, as the number of cores on a single chip continues to increase, it has been a grand challenge on how to effectively manage the energy efficiency of multicore-based systems. In this paper, based on the voltage island and dynamic voltage and frequency scaling (DVFS) techniques, we investigate the energy efficiency of block-partitioned multieore processors, where cores are grouped into blocks with the cores on one block sharing a DVFS- enabled power supply. Depending on the number of cores on each block, we study both symmetric and asymmetric block configurations. We develop a system-level power model (which can support various power management techniques) and derive both block- and system-wide energy-efficient frequencies for systems with block-partitioned multieore processors. Based on the power model, we prove that, for embarrassingly parallel applications, having all cores on a single block can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). However, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration, where the number of required blocks (and power supplies) is logarithmically related to the number of cores on the chip, in that it can achieve the same amount of energy savings as that of the individual block configuration. The energy efficiency of different block configurations is further evaluated through extensive simulations with both synthetic as well as a real life application.展开更多
Recently,Multicore systems use Dynamic Voltage/Frequency Scaling(DV/FS)technology to allow the cores to operate with various voltage and/or frequencies than other cores to save power and enhance the performance.In thi...Recently,Multicore systems use Dynamic Voltage/Frequency Scaling(DV/FS)technology to allow the cores to operate with various voltage and/or frequencies than other cores to save power and enhance the performance.In this paper,an effective and reliable hybridmodel to reduce the energy and makespan in multicore systems is proposed.The proposed hybrid model enhances and integrates the greedy approach with dynamic programming to achieve optimal Voltage/Frequency(Vmin/F)levels.Then,the allocation process is applied based on the availableworkloads.The hybrid model consists of three stages.The first stage gets the optimum safe voltage while the second stage sets the level of energy efficiency,and finally,the third is the allocation stage.Experimental results on various benchmarks show that the proposed model can generate optimal solutions to save energy while minimizing the makespan penalty.Comparisons with other competitive algorithms show that the proposed model provides on average 48%improvements in energy-saving and achieves an 18%reduction in computation time while ensuring a high degree of system reliability.展开更多
Simulation is an important and useful technique helping users understand and model real life systems. Once built, the models can run proving realistic results. This supports making decisions on a more logical and scie...Simulation is an important and useful technique helping users understand and model real life systems. Once built, the models can run proving realistic results. This supports making decisions on a more logical and scientific basis. The paper introduces method of simulation, and describes various types of its application. The authors used the method of analysis of the creation and implementation of the programme code. The authors compared parallel instruction of computing defined to pipelined instructions. The power of simulation is that a common model can be used to design a large variety of systems. An important aspect of the simulation method is that a simulation model is designed to be repeated in actual computer systems, especially in multicore processors. For this reason, it is important to minimize average waiting time for fetch and decode stage instructions. The objective of the research is to prove that the parallel operation of programme code is faster than sequential operation code on the multi processor architecture. The system modeling uses methods and simulation on the parallel computer systems is very precise. The time benefit gained in simulation of mathematical model on the pipeline processor is higher than the one in simulation of mathematical model on the multi processors computer system.展开更多
This paper describes the design for testability (DFT) challenges and techniques of Godson-3 microprocessor, which is a scalable multicore processor based on the scalable mesh of crossbar (SMOC) on-chip network and...This paper describes the design for testability (DFT) challenges and techniques of Godson-3 microprocessor, which is a scalable multicore processor based on the scalable mesh of crossbar (SMOC) on-chip network and targets high-end applications. Advanced techniques are adopted to make the DFT design scalable and achieve low-power and low-cost test with limited IO resources. To achieve a scalable and flexible test access, a highly elaborate test access mechanism (TAM) is implemented to support multiple test instructions and test modes. Taking advantage of multiple identical cores embedding in the processor, scan partition and on-chip comparisons are employed to reduce test power and test time. Test compression technique is also utilized to decrease test time. To further reduce test power, clock controlling logics are designed with ability to turn off clocks of non-testing partitions. In addition, scan collars of CACHEs are designed to perform functional test with low-speed ATE for speed-binning purposes, which poses low complexity and has good correlation results.展开更多
As multi-core processors become the de-facto configuration in modern computers, the adoption of SMP Virtual Machines(VMs) has been increasing, allowing for more efficient use of computing resources. However,because ...As multi-core processors become the de-facto configuration in modern computers, the adoption of SMP Virtual Machines(VMs) has been increasing, allowing for more efficient use of computing resources. However,because of existence of schedulers in both the hypervisor and the guest VMs, this creates a new research problem,viz., double scheduling. Although double scheduling may cause many issues including lock-holder preemption,v CPU stacking, CPU fragmentation, and priority inversion, prior approaches have either introduced new problems and/or addressed the problem incompletely. In this paper, we describe the design and implementation of Flex Core,a new scheduling scheme using v CPU ballooning, which dynamically adjusts the number of v CPUs of a VM at runtime. This essentially eliminates unnecessary scheduling in the hypervisor layer, and thus, boosts performance significantly. An evaluation using a complete KVM-based implementation shows that the average performance improvement for PARSEC applications on a 12-core Intel machine is approximately 52.9%, ranging from 35.4% to79.6%.展开更多
The load power range of modern processors is greatly enlarged because many advanced power management techniques are employed, such as dynamic voltage frequency scaling, Turbo Boosting, and near-threshold voltage (NTV...The load power range of modern processors is greatly enlarged because many advanced power management techniques are employed, such as dynamic voltage frequency scaling, Turbo Boosting, and near-threshold voltage (NTV) technologies. However, because the efficiency of power delivery varies greatly with different load conditions, conventional power delivery designs cannot maintain high efficiency over the entire voltage spectrum, and the gained power saving may be offset by power loss in power delivery. We propose SuperRange, a wide operational range power delivery unit. SuperRange complements the power delivery capability of on-chip voltage regulator and off-chip voltage regulator. On top of SuperRange, we analyze its power conversion characteristics and propose a voltage regulator (VR) aware power management algorithm. Moreover, as more and more cores have been integrated on a singe chip, multiple SuperRange units can serve as basic building blocks to build, in a highly scalable way, more powerful power delivery subsystem with larger power capacity. Experimental results show SuperRange unit offers lx and 1.3x higher power conversion efficiency (PCE) than other two conventional power delivery schemes at NTV region and exhibits an average 70% PCE over entire operational range. It also exhibits superior resilience to power-constrained systems.展开更多
基金supported in part by the National Science Foundation of USA under Grant Nos.EIA-0224377,CNS-0406328,CNS-0509118,and CCF-0621435.
文摘Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.
基金supported in part by NSF Awards of USA under Grant Nos. CNS-0855247,CNS-1016974,and NSF CAREER Award of USA under Grant No. CNS-0953005
文摘Due to the increasing power consumption in modern computing systems, energy management has become an important research area in the last decade. Recently, multicore has emerged to be an energy efficient architecture that exploits parallelisms in modern applications. However, as the number of cores on a single chip continues to increase, it has been a grand challenge on how to effectively manage the energy efficiency of multicore-based systems. In this paper, based on the voltage island and dynamic voltage and frequency scaling (DVFS) techniques, we investigate the energy efficiency of block-partitioned multieore processors, where cores are grouped into blocks with the cores on one block sharing a DVFS- enabled power supply. Depending on the number of cores on each block, we study both symmetric and asymmetric block configurations. We develop a system-level power model (which can support various power management techniques) and derive both block- and system-wide energy-efficient frequencies for systems with block-partitioned multieore processors. Based on the power model, we prove that, for embarrassingly parallel applications, having all cores on a single block can achieve the same energy savings as that of the individual block configuration (where each core forms a single block and has its own power supply). However, for applications with limited degrees of parallelism, we show the superiority of the buddy-asymmetric block configuration, where the number of required blocks (and power supplies) is logarithmically related to the number of cores on the chip, in that it can achieve the same amount of energy savings as that of the individual block configuration. The energy efficiency of different block configurations is further evaluated through extensive simulations with both synthetic as well as a real life application.
文摘Recently,Multicore systems use Dynamic Voltage/Frequency Scaling(DV/FS)technology to allow the cores to operate with various voltage and/or frequencies than other cores to save power and enhance the performance.In this paper,an effective and reliable hybridmodel to reduce the energy and makespan in multicore systems is proposed.The proposed hybrid model enhances and integrates the greedy approach with dynamic programming to achieve optimal Voltage/Frequency(Vmin/F)levels.Then,the allocation process is applied based on the availableworkloads.The hybrid model consists of three stages.The first stage gets the optimum safe voltage while the second stage sets the level of energy efficiency,and finally,the third is the allocation stage.Experimental results on various benchmarks show that the proposed model can generate optimal solutions to save energy while minimizing the makespan penalty.Comparisons with other competitive algorithms show that the proposed model provides on average 48%improvements in energy-saving and achieves an 18%reduction in computation time while ensuring a high degree of system reliability.
文摘Simulation is an important and useful technique helping users understand and model real life systems. Once built, the models can run proving realistic results. This supports making decisions on a more logical and scientific basis. The paper introduces method of simulation, and describes various types of its application. The authors used the method of analysis of the creation and implementation of the programme code. The authors compared parallel instruction of computing defined to pipelined instructions. The power of simulation is that a common model can be used to design a large variety of systems. An important aspect of the simulation method is that a simulation model is designed to be repeated in actual computer systems, especially in multicore processors. For this reason, it is important to minimize average waiting time for fetch and decode stage instructions. The objective of the research is to prove that the parallel operation of programme code is faster than sequential operation code on the multi processor architecture. The system modeling uses methods and simulation on the parallel computer systems is very precise. The time benefit gained in simulation of mathematical model on the pipeline processor is higher than the one in simulation of mathematical model on the multi processors computer system.
基金Supported by the National High-Tech Research and Development 863 Program of China under Grant Nos. 2008AA010901,2009AA01Z125,2009AA01Z103the National Natural Science Foundation of China under Grant Nos. 60736012,60921002,60803029,61050002+1 种基金the National Basic Research 973 Program of China under Grant No. 2005CB321600the Important National Science and Technology Specific Projects under Grant Nos. 2009ZX01028-002-003,2009ZX01029-001-003
文摘This paper describes the design for testability (DFT) challenges and techniques of Godson-3 microprocessor, which is a scalable multicore processor based on the scalable mesh of crossbar (SMOC) on-chip network and targets high-end applications. Advanced techniques are adopted to make the DFT design scalable and achieve low-power and low-cost test with limited IO resources. To achieve a scalable and flexible test access, a highly elaborate test access mechanism (TAM) is implemented to support multiple test instructions and test modes. Taking advantage of multiple identical cores embedding in the processor, scan partition and on-chip comparisons are employed to reduce test power and test time. Test compression technique is also utilized to decrease test time. To further reduce test power, clock controlling logics are designed with ability to turn off clocks of non-testing partitions. In addition, scan collars of CACHEs are designed to perform functional test with low-speed ATE for speed-binning purposes, which poses low complexity and has good correlation results.
文摘As multi-core processors become the de-facto configuration in modern computers, the adoption of SMP Virtual Machines(VMs) has been increasing, allowing for more efficient use of computing resources. However,because of existence of schedulers in both the hypervisor and the guest VMs, this creates a new research problem,viz., double scheduling. Although double scheduling may cause many issues including lock-holder preemption,v CPU stacking, CPU fragmentation, and priority inversion, prior approaches have either introduced new problems and/or addressed the problem incompletely. In this paper, we describe the design and implementation of Flex Core,a new scheduling scheme using v CPU ballooning, which dynamically adjusts the number of v CPUs of a VM at runtime. This essentially eliminates unnecessary scheduling in the hypervisor layer, and thus, boosts performance significantly. An evaluation using a complete KVM-based implementation shows that the average performance improvement for PARSEC applications on a 12-core Intel machine is approximately 52.9%, ranging from 35.4% to79.6%.
基金This work is supported by the National Natural Science Foundation of China under Grant Nos. 61572470, 61532017, 61522406, 61432017, 61376043, and 61221062.
文摘The load power range of modern processors is greatly enlarged because many advanced power management techniques are employed, such as dynamic voltage frequency scaling, Turbo Boosting, and near-threshold voltage (NTV) technologies. However, because the efficiency of power delivery varies greatly with different load conditions, conventional power delivery designs cannot maintain high efficiency over the entire voltage spectrum, and the gained power saving may be offset by power loss in power delivery. We propose SuperRange, a wide operational range power delivery unit. SuperRange complements the power delivery capability of on-chip voltage regulator and off-chip voltage regulator. On top of SuperRange, we analyze its power conversion characteristics and propose a voltage regulator (VR) aware power management algorithm. Moreover, as more and more cores have been integrated on a singe chip, multiple SuperRange units can serve as basic building blocks to build, in a highly scalable way, more powerful power delivery subsystem with larger power capacity. Experimental results show SuperRange unit offers lx and 1.3x higher power conversion efficiency (PCE) than other two conventional power delivery schemes at NTV region and exhibits an average 70% PCE over entire operational range. It also exhibits superior resilience to power-constrained systems.