Energy-proportional computing is one of the foremost constraints in the design of next generation exascale systems. These systems must have a very high FLOP-per-watt ratio to be sustainable, which requires tremendous ...Energy-proportional computing is one of the foremost constraints in the design of next generation exascale systems. These systems must have a very high FLOP-per-watt ratio to be sustainable, which requires tremendous improvements in power efficiency for modern computing systems. This paper focuses on the processor—as still the biggest contributor to the power usage—by considering both its core and uncore power subsystems. The uncore describes those processor functions that are not handled by the core, such as L3 cache and on-chip interconnect, and contributes significantly to the total system power. The uncore frequency scaling (UFS) capability has been available to the user since the Intel Haswell processor generation. In this paper, performance and power models are proposed to use both the UFS and dynamic voltage and frequency scaling (DVFS) to reduce the energy consumption in parallel applications. Then, these models are incorporated into a runtime strategy that performs processor frequency scaling during parallel application execution. The strategy can be implemented at the kernel/firmware level, which makes it suitable for improving the energy efficiency of exascale design. Experiments on a 20-core Haswell-EP machine using the quantum chemistry application GAMESS and NAS benchmark resulted in up to 24% energy savings with as little as 2% performance loss.展开更多
安卓系统为浏览器分配资源时无法感知网页内容,会导致资源过度分配和电量不必要损失。同时,由于CPU可调节频率密度的增长,通过动态电压频率缩放(dynamic voltage and frequency scaling, DVFS)技术实现能耗优化的难度也随之增大。另外...安卓系统为浏览器分配资源时无法感知网页内容,会导致资源过度分配和电量不必要损失。同时,由于CPU可调节频率密度的增长,通过动态电压频率缩放(dynamic voltage and frequency scaling, DVFS)技术实现能耗优化的难度也随之增大。另外在系统默认的调控策略下,忽视了图形处理器(graphics processing unit, GPU)对浏览器运行的作用。针对上述问题,提出一种协同调控CPU和GPU实现功耗优化的方法。首先根据网页加载时处理器运行特征利用逻辑回归对网页进行分类,对网页特征加权实现复杂度量化,根据类别与复杂度采用DVFS技术限制CPU频率的同时调节GPU频率。该方法被应用于谷歌Pixel2 XL上的Chromium浏览器,对排名前500的中文网站进行测试,平均节省了12%功耗的同时减少了5%网页加载时间。展开更多
As low power consumption is the main design issue involved in a network on chip (NoC), researchers are concentrating more on both algorithms and architectural approaches. The conventional Dynamic Frequency Scalin...As low power consumption is the main design issue involved in a network on chip (NoC), researchers are concentrating more on both algorithms and architectural approaches. The conventional Dynamic Frequency Scaling (DFS) and history based Frequency Scaling (HDFS) algorithms are utilized to process the energy constrained data traffic. However, these conventional algorithms achieve higher energy efficiencies, and they result in performance degradation due to the auxiliary latency between clock domains. In this paper, we present a variable power optimization interface for NoC using a Finite State Machine (FSM) approach to attain better performance improvement. The parameters are estimated using 45 nm TSMCCMOS technology. In comparison with DFS system, the evaluation results show that FSM-DFS link achieves 81.55% dynamic power savings on the links in the on-chip network, and 37.5% leakage power savings of the link. Also, this proposed work is evaluated for various performance parameters and compared with conventional work. The simulation results are superior to conventional work.展开更多
To apply a quasi-cyclic low density parity check(QC-LDPC)to different scenarios,a data-stream driven pipelined macro instruction set and a reconfigurable processor architecture are proposed for the typical QC-LDPC alg...To apply a quasi-cyclic low density parity check(QC-LDPC)to different scenarios,a data-stream driven pipelined macro instruction set and a reconfigurable processor architecture are proposed for the typical QC-LDPC algorithm.The data-level parallelism is improved by instructions to dynamically configure the multi-core computing units.Simultaneously,an intelligent adjustment strategy based on a programmable wake-up controller(WuC)is designed so that the computing mode,operating voltage,and frequency of the QC-LDPC algorithm can be adjusted.This adjustment can improve the computing efficiency of the processor.The QC-LDPC processors are verified on the Xilinx ZCU102 field programmable gate array(FPGA)board and the computing efficiency is measured.The experimental results indicate that the QC-LDPC processor can support two encoding lengths of three typical QC-LDPC algorithms and 20 adaptive operating modes of operating voltage and frequency.The maximum efficiency can reach up to 12.18 Gbit/(s·W),which is more flexible than existing state-of-the-art processors for QC-LDPC.展开更多
文摘Energy-proportional computing is one of the foremost constraints in the design of next generation exascale systems. These systems must have a very high FLOP-per-watt ratio to be sustainable, which requires tremendous improvements in power efficiency for modern computing systems. This paper focuses on the processor—as still the biggest contributor to the power usage—by considering both its core and uncore power subsystems. The uncore describes those processor functions that are not handled by the core, such as L3 cache and on-chip interconnect, and contributes significantly to the total system power. The uncore frequency scaling (UFS) capability has been available to the user since the Intel Haswell processor generation. In this paper, performance and power models are proposed to use both the UFS and dynamic voltage and frequency scaling (DVFS) to reduce the energy consumption in parallel applications. Then, these models are incorporated into a runtime strategy that performs processor frequency scaling during parallel application execution. The strategy can be implemented at the kernel/firmware level, which makes it suitable for improving the energy efficiency of exascale design. Experiments on a 20-core Haswell-EP machine using the quantum chemistry application GAMESS and NAS benchmark resulted in up to 24% energy savings with as little as 2% performance loss.
文摘安卓系统为浏览器分配资源时无法感知网页内容,会导致资源过度分配和电量不必要损失。同时,由于CPU可调节频率密度的增长,通过动态电压频率缩放(dynamic voltage and frequency scaling, DVFS)技术实现能耗优化的难度也随之增大。另外在系统默认的调控策略下,忽视了图形处理器(graphics processing unit, GPU)对浏览器运行的作用。针对上述问题,提出一种协同调控CPU和GPU实现功耗优化的方法。首先根据网页加载时处理器运行特征利用逻辑回归对网页进行分类,对网页特征加权实现复杂度量化,根据类别与复杂度采用DVFS技术限制CPU频率的同时调节GPU频率。该方法被应用于谷歌Pixel2 XL上的Chromium浏览器,对排名前500的中文网站进行测试,平均节省了12%功耗的同时减少了5%网页加载时间。
文摘As low power consumption is the main design issue involved in a network on chip (NoC), researchers are concentrating more on both algorithms and architectural approaches. The conventional Dynamic Frequency Scaling (DFS) and history based Frequency Scaling (HDFS) algorithms are utilized to process the energy constrained data traffic. However, these conventional algorithms achieve higher energy efficiencies, and they result in performance degradation due to the auxiliary latency between clock domains. In this paper, we present a variable power optimization interface for NoC using a Finite State Machine (FSM) approach to attain better performance improvement. The parameters are estimated using 45 nm TSMCCMOS technology. In comparison with DFS system, the evaluation results show that FSM-DFS link achieves 81.55% dynamic power savings on the links in the on-chip network, and 37.5% leakage power savings of the link. Also, this proposed work is evaluated for various performance parameters and compared with conventional work. The simulation results are superior to conventional work.
基金the National Key Research and Development Program of China(2019YFB1803600)the Key Scientific Research Program of Shaanxi Provincial Department of Education(22JY059)the China Civil Aviation Airworthiness Center Open Foundation(SH2021111903)。
文摘To apply a quasi-cyclic low density parity check(QC-LDPC)to different scenarios,a data-stream driven pipelined macro instruction set and a reconfigurable processor architecture are proposed for the typical QC-LDPC algorithm.The data-level parallelism is improved by instructions to dynamically configure the multi-core computing units.Simultaneously,an intelligent adjustment strategy based on a programmable wake-up controller(WuC)is designed so that the computing mode,operating voltage,and frequency of the QC-LDPC algorithm can be adjusted.This adjustment can improve the computing efficiency of the processor.The QC-LDPC processors are verified on the Xilinx ZCU102 field programmable gate array(FPGA)board and the computing efficiency is measured.The experimental results indicate that the QC-LDPC processor can support two encoding lengths of three typical QC-LDPC algorithms and 20 adaptive operating modes of operating voltage and frequency.The maximum efficiency can reach up to 12.18 Gbit/(s·W),which is more flexible than existing state-of-the-art processors for QC-LDPC.