This paper presents an E-band frequency quadrupler in 40-nm CMOS technology.The circuit employs two push-push frequency doublers and two single-stage neutralized amplifiers.The pseudo-differential class-B biased casco...This paper presents an E-band frequency quadrupler in 40-nm CMOS technology.The circuit employs two push-push frequency doublers and two single-stage neutralized amplifiers.The pseudo-differential class-B biased cascode topo-logy is adopted for the frequency doubler,which improves the reverse isolation and the conversion gain.Neutralization tech-nique is applied to increase the stability and the power gain of the amplifiers simultaneously.The stacked transformers are used for single-ended-to-differential transformation as well as output bandpass filtering.The output bandpass filter enhances the 4th-harmonic output power,while rejecting the undesired harmonics,especially the 2nd harmonic.The core chip is 0.23 mm^(2)in size and consumes 34 mW.The measured 4th harmonic achieves a maximum output power of 1.7 dBm with a peak conversion gain of 3.4 dB at 76 GHz.The fundamental and 2nd-harmonic suppressions of over 45 and 20 dB are achieved for the spectrum from 74 to 82 GHz,respectively.展开更多
Hybrid memory systems composed of dynamic random access memory(DRAM)and Non-volatile memory(NVM)often exploit page migration technologies to fully take the advantages of different memory media.Most previous proposals ...Hybrid memory systems composed of dynamic random access memory(DRAM)and Non-volatile memory(NVM)often exploit page migration technologies to fully take the advantages of different memory media.Most previous proposals usually migrate data at a granularity of 4 KB pages,and thus waste memory bandwidth and DRAM resource.In this paper,we propose Mocha,a non-hierarchical architecture that organizes DRAM and NVM in a flat address space physically,but manages them in a cache/memory hierarchy.Since the commercial NVM device-Intel Optane DC Persistent Memory Modules(DCPMM)actually access the physical media at a granularity of 256 bytes(an Optane block),we manage the DRAM cache at the 256-byte size to adapt to this feature of Optane.This design not only enables fine-grained data migration and management for the DRAM cache,but also avoids write amplification for Intel Optane DCPMM.We also create an Indirect Address Cache(IAC)in Hybrid Memory Controller(HMC)and propose a reverse address mapping table in the DRAM to speed up address translation and cache replacement.Moreover,we exploit a utility-based caching mechanism to filter cold blocks in the NVM,and further improve the efficiency of the DRAM cache.We implement Mocha in an architectural simulator.Experimental results show that Mocha can improve application performance by 8.2%on average(up to 24.6%),reduce 6.9%energy consumption and 25.9%data migration traffic on average,compared with a typical hybrid memory architecture-HSCC.展开更多
Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scena...Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scenarios,entities and their relationships are subject to constant changes.Graphs that record such changes are called dynamic graphs.In recent years,the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results.As the scale of dynamic graphs becomes larger,higher performance requirements are demanded to dynamic graph processing systems.With the massive parallel processing power and high memory bandwidth,GPUs become mainstream vehicles to accelerate dynamic graph processing tasks.GPU-based dynamic graph processing systems mainly address two challenges:maintaining the graph data when updates occur(i.e.,graph updating)and producing analytics results in time(i.e.,graph computing).In this paper,we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing.To comprehensively discuss existing dynamic graph processing systems on GPUs,we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing.In addition,we discuss the challenges and future research directions of dynamic graph processing on GPUs.展开更多
ATPG(automatic test pattern generation)是VLSI(very large scale integration circuits)电路测试中非常重要的技术,它的好坏直接影响测试成本与开销.然而现有的并行ATPG方法普遍存在负载不均衡、并行策略单一、存储开销大和数据局部...ATPG(automatic test pattern generation)是VLSI(very large scale integration circuits)电路测试中非常重要的技术,它的好坏直接影响测试成本与开销.然而现有的并行ATPG方法普遍存在负载不均衡、并行策略单一、存储开销大和数据局部性差等问题.由于图计算的高并行度和高扩展性等优点,快速、高效、低存储开销和高可扩展性的图计算系统可能是有效支持ATPG的重要工具,这将对减少测试成本显得尤为重要.本文将对图计算在组合ATPG中的应用进行探究;介绍图计算模型将ATPG算法转化为图算法的方法;分析现有图计算系统应用于ATPG面临的挑战;提出面向ATPG的单机图计算系统,并从基于传统架构的优化、新兴硬件的加速和基于新兴存储器件的优化几个方面,对图计算系统支持ATPG所面临的挑战和未来研究方向进行了讨论.展开更多
Unikernels provide an efficient and lightweight way to deploy cloud computing services in application-specialized and single-address-space virtual machines (VMs). They can efficiently deploy hundreds of unikernel-base...Unikernels provide an efficient and lightweight way to deploy cloud computing services in application-specialized and single-address-space virtual machines (VMs). They can efficiently deploy hundreds of unikernel-based VMs in a single physical server. In such a cloud computing platform, main memory is the primary bottleneck resource for high-density application deployment. Recently, non-volatile memory (NVM) technologies has become increasingly popular in cloud data centers because they can offer extremely large memory capacity at a low expense. However, there still remain many challenges to utilize NVMs for unikernel-based VMs, such as the difficulty of heterogeneous memory allocation and high performance overhead of address translations.In this paper, we present UCat, a heterogeneous memory management mechanism that support multi-grained memory allocation for unikernels. We propose front-end/back-end cooperative address space mapping to expose the host memory heterogeneity to unikernels. UCat exploits large pages to reduce the cost of two-layer address translation in virtualization environments, and leverages slab allocation to reduce memory waste due to internal memory fragmentation. We implement UCat based on a popular unikernel--OSv and conduct extensive experiments to evaluate its efficiency. Experimental results show that UCat can reduce the memory consumption of unikernels by 50% and TLB miss rate by 41%, and improve the throughput of real-world benchmarks such as memslap and YCSB by up to 18.5% and 14.8%, respectively.展开更多
With the increasing amount of data,there is an urgent need for efficient sorting algorithms to process large data sets.Hardware sorting algorithms have attracted much attention because they can take advantage of diffe...With the increasing amount of data,there is an urgent need for efficient sorting algorithms to process large data sets.Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware's parallelism.But the traditional hardware sort accelerators suffer“memory wall”problems since their multiple rounds of data transmission between the memory and the processor.In this paper,we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously.Using this designed ReCAM array,we present ReCSA,which is the first dedicated ReCAM-based sort accelerator.Besides hardware designs,we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance.The sorting algorithm in ReCSA can process various data types,such as integer,float,double,and strings.We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators.The experimental results show that ReCSA has 90.92×,46.13×,27.38×,84.57×,and 3.36×speedups against CPU-,GPU-,FPGA-,NDP-,and PIM-based platforms when processing numeric data sets.ReCSA also has 24.82×,32.94×,and 18.22×performance improvement when processing string data sets compared with CPU-,GPU-,and FPGA-based platforms.展开更多
Purpose To investigate the methylation status and expression level of G protein-coupled receptor 135(GPR135)in nasopharyngeal carcinoma(NPC)and determine its prognostic value.Methods The GPR135 methylation data of NPC...Purpose To investigate the methylation status and expression level of G protein-coupled receptor 135(GPR135)in nasopharyngeal carcinoma(NPC)and determine its prognostic value.Methods The GPR135 methylation data of NPC and normal nasopharyngeal tissues were obtained from the Gene Expression Omnibus(GEO)GSE52068 dataset.The GPR135 promoter region methylation level in four normal nasopharyngeal epithelial tissues and eight NPC tissues was detected by bisulfite sequencing.GPR135 expression in NPC and normal nasopharyngeal tissue was obtained from the GEO GSE13597 dataset.The GPR135 mRNA expression levels in 13 NPC and 26 healthy control tissues were assessed with quantitative real-time PCR(qRT-PCR).The GPR135 expression level in 124 NPC tissue sections was analyzed by immunohistochemistry.The correlation between GPR135 expression and clinicopathological features was analyzed by a chi-square test.GPR135 expression in patients with NPC was evaluated by immunohistochemistry,and its influence on prognosis was assessed by Kaplan-Meier and Cox regression analyses.Results The bisulfite sequencing demonstrated that the GPR135 promoter region was highly methylated in NPC tissues.The immunohistochemistry results revealed that patients with high GPR135 expression had better overall survival(hazard ratio[HR]=0.177,95%confidence interval[95%CI]:0.072–0.437,P=0.008),disease-free survival(HR=0.4401,95%CI:0.222–0.871,P=0.034),and local recurrence-free survival(HR=0.307,95%CI:0.119–0.790,P=0.046)than those with low GPR135 expression.Conclusion GPR135 is hypermethylated in NPC,where high GPR135 expression indicates a positive prognosis.Therefore,GPR135 might be a prognostic indicator.展开更多
Although many graph processing systems have been proposed, graphs in the real-world are often dynamic. It is important to keep the results of graph computation up-todate. Incremental computation is demonstrated to be ...Although many graph processing systems have been proposed, graphs in the real-world are often dynamic. It is important to keep the results of graph computation up-todate. Incremental computation is demonstrated to be an efficient solution to update calculated results. Recently, many incremental graph processing systems have been proposed to handle dynamic graphs in an asynchronous way and are able to achieve better performance than those processed in a synchronous way. However, these solutions still suffer from sub-optimal convergence speed due to their slow propagation of important vertex state (important to convergence speed) and poor locality. In order to solve these problems, we propose a novel graph processing framework. It introduces a dynamic partition method to gather the important vertices for high locality, and then uses a priority-based scheduling algorithm to assign them with a higher priority for an effective processing order. By such means, it is able to reduce the number of updates and increase the locality, thereby reducing the convergence time. Experimental results show that our method reduces the number of updates by 30%, and reduces the total execution time by 35%, compared with state-of-the-art systems.展开更多
Emerging byte-addressable non-volatile memory(NVM)technologies offer higher density and lower cost than DRAM,at the expense of lower performance and limited write endurance.There have been many studies on hybrid NVM/D...Emerging byte-addressable non-volatile memory(NVM)technologies offer higher density and lower cost than DRAM,at the expense of lower performance and limited write endurance.There have been many studies on hybrid NVM/DRAM memory management in a single physical server.However,it is still an open problem on how to manage hybrid memories efficiently in a distributed environment.This paper proposes Alloy,a memory resource abstraction and data placement strategy for an RDMA-enabled distributed hybrid memory pool(DHMP).Alloy provides simple APIs for applications to utilize DRAM or NVM resource in the DHMP,without being aware of the hardware details of the DHMP.We propose a hotness-aware data placement scheme,which combines hot data migration,data replication and write merging together to improve application performance and reduce the cost of DRAM.We evaluate Alloy with several micro-benchmark workloads and public benchmark workloads.Experimental results show that Alloy can significantly reduce the DRAM usage in the DHMP by up to 95%,while reducing the total memory access time by up to 57%compared with the state-of-the-art approaches.展开更多
Storage class memory (SCM) has the potential to revolutionize the memory landscape by its non-volatile and byte-addressable properties. However, there is little published work about exploring its usage for modem vir...Storage class memory (SCM) has the potential to revolutionize the memory landscape by its non-volatile and byte-addressable properties. However, there is little published work about exploring its usage for modem virtualized cloud infrastructure. We propose SCM-vWrite, a novel architecture designed around SCM, to ease the performance interference of virtualized storage subsystem. Through a case study on a typical virtualized cloud system, we first describe why cur- rent writeback manners are not suitable for a virtualized en- vironment, then design and implement SCM-vWrite to im- prove this problem. We also use typical benchmarks and re- alistic workloads to evaluate its performance. Compared with the traditional method on a conventional architecture, the ex- perimental result shows that SCM-vWrite can coordinate the writeback flows more effectively among multiple co-located guest operating systems, achieving a better disk I/O perfor- mance without any loss of reliability.展开更多
With the growing allel programming,nowadays popularity of task-based partask-parallel programming libraries and languages are still with limited support for coordinating parallel tasks.Such limitation forces programme...With the growing allel programming,nowadays popularity of task-based partask-parallel programming libraries and languages are still with limited support for coordinating parallel tasks.Such limitation forces programmers to use additional independent components to coordinate the parallel tasks -the components can be third-party libraries or additional components in the same programming library or language.Moreover,mixing tasks and coordination components increase the difficulty of task-based programming,and blind schedulers for understanding tasks'dependencies. In this paper,we propose a task-based parallel programming library,Function Flow,which coordinates tasks in the purpose of avoiding additional independent coordination components.First,we use dependency expression to represent ubiquitous tasks'termination.The key idea behind dependency expression is to use && for both task's termination and Ⅱ for any task termination,along with the combination of dependency expressions.Second,as runtime support,we use a lightweight representation for dependency expression. Also,we use suspended-task queue to schedule tasks that still have prerequisites to run. Finally,we demonstrate Function Flow's effectiveness in two aspects,case study about implementing popular parallel patterns with FunctionFIow,and performance comparision with state-of-the-art practice,TBB.Our demonstration shows that FunctionFIow can generally coordinate parallel tasks without involving additional components,along with comparable performance with TBB.展开更多
Page migration has long been adopted in hybrid memory systems comprising dynamic random access memory(DRAM)and non-volatile memories(NVMs),to improve the system performance and energy efficiency.However,page migration...Page migration has long been adopted in hybrid memory systems comprising dynamic random access memory(DRAM)and non-volatile memories(NVMs),to improve the system performance and energy efficiency.However,page migration introduces some side effects,such as more translation lookaside buffer(TLB)misses,breaking memory contiguity,and extra memory accesses due to page table updating.In this paper,we propose superpagefriendly page table called SuperPT to reduce the performance overhead of serving TLB misses.By leveraging a virtual hashed page table and a hybrid DRAM allocator,SuperPT performs address translations in a flexible and efficient way while still remaining the contiguity within the migrated pages.展开更多
基金the National Key Research,Development Program of China under Grant 2018YFB1802100the Major Key Project of PCL(PCL2021A01-2).
文摘This paper presents an E-band frequency quadrupler in 40-nm CMOS technology.The circuit employs two push-push frequency doublers and two single-stage neutralized amplifiers.The pseudo-differential class-B biased cascode topo-logy is adopted for the frequency doubler,which improves the reverse isolation and the conversion gain.Neutralization tech-nique is applied to increase the stability and the power gain of the amplifiers simultaneously.The stacked transformers are used for single-ended-to-differential transformation as well as output bandpass filtering.The output bandpass filter enhances the 4th-harmonic output power,while rejecting the undesired harmonics,especially the 2nd harmonic.The core chip is 0.23 mm^(2)in size and consumes 34 mW.The measured 4th harmonic achieves a maximum output power of 1.7 dBm with a peak conversion gain of 3.4 dB at 76 GHz.The fundamental and 2nd-harmonic suppressions of over 45 and 20 dB are achieved for the spectrum from 74 to 82 GHz,respectively.
基金supported jointly by the National Key Research and Development Program of China (No.2022YFB4500303)the National Natural Science Foundation of China (NSFC) (Grant Nos.62072198,61832006,61825202,61929103).
文摘Hybrid memory systems composed of dynamic random access memory(DRAM)and Non-volatile memory(NVM)often exploit page migration technologies to fully take the advantages of different memory media.Most previous proposals usually migrate data at a granularity of 4 KB pages,and thus waste memory bandwidth and DRAM resource.In this paper,we propose Mocha,a non-hierarchical architecture that organizes DRAM and NVM in a flat address space physically,but manages them in a cache/memory hierarchy.Since the commercial NVM device-Intel Optane DC Persistent Memory Modules(DCPMM)actually access the physical media at a granularity of 256 bytes(an Optane block),we manage the DRAM cache at the 256-byte size to adapt to this feature of Optane.This design not only enables fine-grained data migration and management for the DRAM cache,but also avoids write amplification for Intel Optane DCPMM.We also create an Indirect Address Cache(IAC)in Hybrid Memory Controller(HMC)and propose a reverse address mapping table in the DRAM to speed up address translation and cache replacement.Moreover,we exploit a utility-based caching mechanism to filter cold blocks in the NVM,and further improve the efficiency of the DRAM cache.We implement Mocha in an architectural simulator.Experimental results show that Mocha can improve application performance by 8.2%on average(up to 24.6%),reduce 6.9%energy consumption and 25.9%data migration traffic on average,compared with a typical hybrid memory architecture-HSCC.
基金National Natural Science Foundation of China(Grant Nos.61972444,61825202,62072195,and 61832006)Zhejiang Lab(2022P10AC02).
文摘Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scenarios,entities and their relationships are subject to constant changes.Graphs that record such changes are called dynamic graphs.In recent years,the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results.As the scale of dynamic graphs becomes larger,higher performance requirements are demanded to dynamic graph processing systems.With the massive parallel processing power and high memory bandwidth,GPUs become mainstream vehicles to accelerate dynamic graph processing tasks.GPU-based dynamic graph processing systems mainly address two challenges:maintaining the graph data when updates occur(i.e.,graph updating)and producing analytics results in time(i.e.,graph computing).In this paper,we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing.To comprehensively discuss existing dynamic graph processing systems on GPUs,we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing.In addition,we discuss the challenges and future research directions of dynamic graph processing on GPUs.
文摘ATPG(automatic test pattern generation)是VLSI(very large scale integration circuits)电路测试中非常重要的技术,它的好坏直接影响测试成本与开销.然而现有的并行ATPG方法普遍存在负载不均衡、并行策略单一、存储开销大和数据局部性差等问题.由于图计算的高并行度和高扩展性等优点,快速、高效、低存储开销和高可扩展性的图计算系统可能是有效支持ATPG的重要工具,这将对减少测试成本显得尤为重要.本文将对图计算在组合ATPG中的应用进行探究;介绍图计算模型将ATPG算法转化为图算法的方法;分析现有图计算系统应用于ATPG面临的挑战;提出面向ATPG的单机图计算系统,并从基于传统架构的优化、新兴硬件的加速和基于新兴存储器件的优化几个方面,对图计算系统支持ATPG所面临的挑战和未来研究方向进行了讨论.
基金supported by the National Natural Science Foundation of China(Grant Nos.62072198,61732010,61825202,and 62032008).
文摘Unikernels provide an efficient and lightweight way to deploy cloud computing services in application-specialized and single-address-space virtual machines (VMs). They can efficiently deploy hundreds of unikernel-based VMs in a single physical server. In such a cloud computing platform, main memory is the primary bottleneck resource for high-density application deployment. Recently, non-volatile memory (NVM) technologies has become increasingly popular in cloud data centers because they can offer extremely large memory capacity at a low expense. However, there still remain many challenges to utilize NVMs for unikernel-based VMs, such as the difficulty of heterogeneous memory allocation and high performance overhead of address translations.In this paper, we present UCat, a heterogeneous memory management mechanism that support multi-grained memory allocation for unikernels. We propose front-end/back-end cooperative address space mapping to expose the host memory heterogeneity to unikernels. UCat exploits large pages to reduce the cost of two-layer address translation in virtualization environments, and leverages slab allocation to reduce memory waste due to internal memory fragmentation. We implement UCat based on a popular unikernel--OSv and conduct extensive experiments to evaluate its efficiency. Experimental results show that UCat can reduce the memory consumption of unikernels by 50% and TLB miss rate by 41%, and improve the throughput of real-world benchmarks such as memslap and YCSB by up to 18.5% and 14.8%, respectively.
基金supported by the National Natural Science Foundation of China(Grant Nos.61832006,62072195,and 61825202).
文摘With the increasing amount of data,there is an urgent need for efficient sorting algorithms to process large data sets.Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware's parallelism.But the traditional hardware sort accelerators suffer“memory wall”problems since their multiple rounds of data transmission between the memory and the processor.In this paper,we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously.Using this designed ReCAM array,we present ReCSA,which is the first dedicated ReCAM-based sort accelerator.Besides hardware designs,we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance.The sorting algorithm in ReCSA can process various data types,such as integer,float,double,and strings.We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators.The experimental results show that ReCSA has 90.92×,46.13×,27.38×,84.57×,and 3.36×speedups against CPU-,GPU-,FPGA-,NDP-,and PIM-based platforms when processing numeric data sets.ReCSA also has 24.82×,32.94×,and 18.22×performance improvement when processing string data sets compared with CPU-,GPU-,and FPGA-based platforms.
基金supported by the Natural Science Foundation Key Projects of Guangxi[grant number 2018GXNSFDA050021]the“139”Talent Cultivation Plan Program of Guangxi Medical High-level Backbone,the National Natural Science Foundation of China[grant number 82160479]+1 种基金the CSCO Youth Innovative Oncology Research Fund[grant number Y-Young2020-0520]the Scientific Research Program of the Health Commission of Guangxi Zhuang Autonomous Region[grant numbers Z20201200,Z20170816,Z20170207].
文摘Purpose To investigate the methylation status and expression level of G protein-coupled receptor 135(GPR135)in nasopharyngeal carcinoma(NPC)and determine its prognostic value.Methods The GPR135 methylation data of NPC and normal nasopharyngeal tissues were obtained from the Gene Expression Omnibus(GEO)GSE52068 dataset.The GPR135 promoter region methylation level in four normal nasopharyngeal epithelial tissues and eight NPC tissues was detected by bisulfite sequencing.GPR135 expression in NPC and normal nasopharyngeal tissue was obtained from the GEO GSE13597 dataset.The GPR135 mRNA expression levels in 13 NPC and 26 healthy control tissues were assessed with quantitative real-time PCR(qRT-PCR).The GPR135 expression level in 124 NPC tissue sections was analyzed by immunohistochemistry.The correlation between GPR135 expression and clinicopathological features was analyzed by a chi-square test.GPR135 expression in patients with NPC was evaluated by immunohistochemistry,and its influence on prognosis was assessed by Kaplan-Meier and Cox regression analyses.Results The bisulfite sequencing demonstrated that the GPR135 promoter region was highly methylated in NPC tissues.The immunohistochemistry results revealed that patients with high GPR135 expression had better overall survival(hazard ratio[HR]=0.177,95%confidence interval[95%CI]:0.072–0.437,P=0.008),disease-free survival(HR=0.4401,95%CI:0.222–0.871,P=0.034),and local recurrence-free survival(HR=0.307,95%CI:0.119–0.790,P=0.046)than those with low GPR135 expression.Conclusion GPR135 is hypermethylated in NPC,where high GPR135 expression indicates a positive prognosis.Therefore,GPR135 might be a prognostic indicator.
基金the National Natural Science Foundation of China (Grant No. 61702202)China Postdoctoral Science Foundation Funded Project (2017M610477 and 2017T100555).
文摘Although many graph processing systems have been proposed, graphs in the real-world are often dynamic. It is important to keep the results of graph computation up-todate. Incremental computation is demonstrated to be an efficient solution to update calculated results. Recently, many incremental graph processing systems have been proposed to handle dynamic graphs in an asynchronous way and are able to achieve better performance than those processed in a synchronous way. However, these solutions still suffer from sub-optimal convergence speed due to their slow propagation of important vertex state (important to convergence speed) and poor locality. In order to solve these problems, we propose a novel graph processing framework. It introduces a dynamic partition method to gather the important vertices for high locality, and then uses a priority-based scheduling algorithm to assign them with a higher priority for an effective processing order. By such means, it is able to reduce the number of updates and increase the locality, thereby reducing the convergence time. Experimental results show that our method reduces the number of updates by 30%, and reduces the total execution time by 35%, compared with state-of-the-art systems.
基金would like to thank the anonymous reviewers for their insightful comments.This work was supported jointly by National Key Research and Development Program of China(2017YFB1001603)National Natural Science Foundation of China(NSFC)(Grants Nos.61672251,61732010,61825202)。
文摘Emerging byte-addressable non-volatile memory(NVM)technologies offer higher density and lower cost than DRAM,at the expense of lower performance and limited write endurance.There have been many studies on hybrid NVM/DRAM memory management in a single physical server.However,it is still an open problem on how to manage hybrid memories efficiently in a distributed environment.This paper proposes Alloy,a memory resource abstraction and data placement strategy for an RDMA-enabled distributed hybrid memory pool(DHMP).Alloy provides simple APIs for applications to utilize DRAM or NVM resource in the DHMP,without being aware of the hardware details of the DHMP.We propose a hotness-aware data placement scheme,which combines hot data migration,data replication and write merging together to improve application performance and reduce the cost of DRAM.We evaluate Alloy with several micro-benchmark workloads and public benchmark workloads.Experimental results show that Alloy can significantly reduce the DRAM usage in the DHMP by up to 95%,while reducing the total memory access time by up to 57%compared with the state-of-the-art approaches.
文摘Storage class memory (SCM) has the potential to revolutionize the memory landscape by its non-volatile and byte-addressable properties. However, there is little published work about exploring its usage for modem virtualized cloud infrastructure. We propose SCM-vWrite, a novel architecture designed around SCM, to ease the performance interference of virtualized storage subsystem. Through a case study on a typical virtualized cloud system, we first describe why cur- rent writeback manners are not suitable for a virtualized en- vironment, then design and implement SCM-vWrite to im- prove this problem. We also use typical benchmarks and re- alistic workloads to evaluate its performance. Compared with the traditional method on a conventional architecture, the ex- perimental result shows that SCM-vWrite can coordinate the writeback flows more effectively among multiple co-located guest operating systems, achieving a better disk I/O perfor- mance without any loss of reliability.
基金the National High-Tech Research and Development Program of China (2015AA015303)the National Natural Science Foundation of China (Grant No.61732010).
文摘With the growing allel programming,nowadays popularity of task-based partask-parallel programming libraries and languages are still with limited support for coordinating parallel tasks.Such limitation forces programmers to use additional independent components to coordinate the parallel tasks -the components can be third-party libraries or additional components in the same programming library or language.Moreover,mixing tasks and coordination components increase the difficulty of task-based programming,and blind schedulers for understanding tasks'dependencies. In this paper,we propose a task-based parallel programming library,Function Flow,which coordinates tasks in the purpose of avoiding additional independent coordination components.First,we use dependency expression to represent ubiquitous tasks'termination.The key idea behind dependency expression is to use && for both task's termination and Ⅱ for any task termination,along with the combination of dependency expressions.Second,as runtime support,we use a lightweight representation for dependency expression. Also,we use suspended-task queue to schedule tasks that still have prerequisites to run. Finally,we demonstrate Function Flow's effectiveness in two aspects,case study about implementing popular parallel patterns with FunctionFIow,and performance comparision with state-of-the-art practice,TBB.Our demonstration shows that FunctionFIow can generally coordinate parallel tasks without involving additional components,along with comparable performance with TBB.
文摘Page migration has long been adopted in hybrid memory systems comprising dynamic random access memory(DRAM)and non-volatile memories(NVMs),to improve the system performance and energy efficiency.However,page migration introduces some side effects,such as more translation lookaside buffer(TLB)misses,breaking memory contiguity,and extra memory accesses due to page table updating.In this paper,we propose superpagefriendly page table called SuperPT to reduce the performance overhead of serving TLB misses.By leveraging a virtual hashed page table and a hybrid DRAM allocator,SuperPT performs address translations in a flexible and efficient way while still remaining the contiguity within the migrated pages.