期刊文献+
共找到18篇文章
< 1 >
每页显示 20 50 100
面向忆阻器存内计算架构的高能效编解码机制
1
作者 黄禹 郑龙 +4 位作者 刘海峰 邱启航 辛杰 廖小飞 金海 《中国科学:信息科学》 CSCD 北大核心 2024年第8期1827-1842,共16页
近年来,以忆阻器为代表的存内计算架构被广泛研究,用于加速各种应用,并有望突破冯·诺伊曼(von Neumann)架构面临的内存墙瓶颈.本文观察到忆阻器计算操作的能源消耗存在不对称性,即在低电阻状态下对忆阻器单元的操作能耗可能比在高... 近年来,以忆阻器为代表的存内计算架构被广泛研究,用于加速各种应用,并有望突破冯·诺伊曼(von Neumann)架构面临的内存墙瓶颈.本文观察到忆阻器计算操作的能源消耗存在不对称性,即在低电阻状态下对忆阻器单元的操作能耗可能比在高电阻状态下高出数个数量级.这为通过减少低电阻状态单元的数量来节省计算能源提供了机会.为此,本文提出了一套通用且高效的忆阻器编解码机制,可以无缝集成到现有加速器中,并且不会影响其计算结果.在编码部分,设计了一个基于减法的编码器,实现了低电阻状态到高电阻状态的编码转换,并将编码问题表述为图遍历问题以实现最优的编码结果在解码部分,配备了一个轻量级的硬件解码器,用于还原编码的计算结果,并且避免引入额外的计算时间开销。实验结果显示,本方案在机器学习和图计算等多个领域取得不俗效果,分别实现了高达31.3%和56.0%的能源节约. 展开更多
关键词 存内计算 忆阻器 加速器 高能效 机器学习 图计算
原文传递
An E-band CMOS frequency quadrupler with 1.7-dBm output power and 45-dB fundamental suppression
2
作者 xiaofei liao Dixian Zhao Xiaohu You 《Journal of Semiconductors》 EI CAS CSCD 2022年第9期46-52,共7页
This paper presents an E-band frequency quadrupler in 40-nm CMOS technology.The circuit employs two push-push frequency doublers and two single-stage neutralized amplifiers.The pseudo-differential class-B biased casco... This paper presents an E-band frequency quadrupler in 40-nm CMOS technology.The circuit employs two push-push frequency doublers and two single-stage neutralized amplifiers.The pseudo-differential class-B biased cascode topo-logy is adopted for the frequency doubler,which improves the reverse isolation and the conversion gain.Neutralization tech-nique is applied to increase the stability and the power gain of the amplifiers simultaneously.The stacked transformers are used for single-ended-to-differential transformation as well as output bandpass filtering.The output bandpass filter enhances the 4th-harmonic output power,while rejecting the undesired harmonics,especially the 2nd harmonic.The core chip is 0.23 mm^(2)in size and consumes 34 mW.The measured 4th harmonic achieves a maximum output power of 1.7 dBm with a peak conversion gain of 3.4 dB at 76 GHz.The fundamental and 2nd-harmonic suppressions of over 45 and 20 dB are achieved for the spectrum from 74 to 82 GHz,respectively. 展开更多
关键词 capacitor neutralization CMOS E-band frequency doubler frequency quadrupler push-push
下载PDF
一种冗余感知的高能效图计算加速器
3
作者 姚鹏程 廖小飞 +6 位作者 金海 周宇航 徐鹏 张伟 曾圳 潘晨高 朱冰 《中国科学:信息科学》 CSCD 北大核心 2024年第6期1369-1385,共17页
图作为一种灵活表达对象之间关系的数据结构,广泛地应用于各类重要的现实场景.近年来,随着性能提升速度放缓,通用处理器逐渐无法满足图计算应用的需求,并成为限制图计算发展的主要瓶颈.因此,面向图计算的领域专用加速器成为近年来的研... 图作为一种灵活表达对象之间关系的数据结构,广泛地应用于各类重要的现实场景.近年来,随着性能提升速度放缓,通用处理器逐渐无法满足图计算应用的需求,并成为限制图计算发展的主要瓶颈.因此,面向图计算的领域专用加速器成为近年来的研究热点.通过定制化的硬件设计,图计算加速器可以在图计算应用中取得通用处理器数十倍的性能.然而,现有的图计算加速器在运行宽度优先算法时会频繁地重复访问幂律顶点的相关数据,进而导致了严重的冗余访存问题.在特定场景下,现有的图计算加速器的性能甚至低于通用CPU.为了解决该问题,本文提出一种冗余感知的高能效图计算加速器JiFeng.当幂律顶点完成迭代计算时,JiFeng通过跳过剩余的相邻边大幅减少其被重复访问的次数.JiFeng实现了一系列软硬件协同设计,在保证负载均衡的同时提升硬件的执行效率.为了验证JiFeng的有效性,本文采用FPGA原型系统对相关设计进行性能评估.JiFeng在典型的生成图和现实图上实现最高每秒遍历4612亿条边的性能和每秒每瓦特遍历125亿条边的能效比,并在2023年11月的图计算超算排行榜GreenGraph500的小数据集榜单上取得第2名的成绩. 展开更多
关键词 图计算 加速器 宽度优先搜索 冗余访存 FPGA
原文传递
A hybrid memory architecture supporting fine-grained data migration 被引量:1
4
作者 Ye CHI Jianhui YUE +2 位作者 xiaofei liao Haikun LIU Hai JIN 《Frontiers of Computer Science》 SCIE EI CSCD 2024年第2期31-41,共11页
Hybrid memory systems composed of dynamic random access memory(DRAM)and Non-volatile memory(NVM)often exploit page migration technologies to fully take the advantages of different memory media.Most previous proposals ... Hybrid memory systems composed of dynamic random access memory(DRAM)and Non-volatile memory(NVM)often exploit page migration technologies to fully take the advantages of different memory media.Most previous proposals usually migrate data at a granularity of 4 KB pages,and thus waste memory bandwidth and DRAM resource.In this paper,we propose Mocha,a non-hierarchical architecture that organizes DRAM and NVM in a flat address space physically,but manages them in a cache/memory hierarchy.Since the commercial NVM device-Intel Optane DC Persistent Memory Modules(DCPMM)actually access the physical media at a granularity of 256 bytes(an Optane block),we manage the DRAM cache at the 256-byte size to adapt to this feature of Optane.This design not only enables fine-grained data migration and management for the DRAM cache,but also avoids write amplification for Intel Optane DCPMM.We also create an Indirect Address Cache(IAC)in Hybrid Memory Controller(HMC)and propose a reverse address mapping table in the DRAM to speed up address translation and cache replacement.Moreover,we exploit a utility-based caching mechanism to filter cold blocks in the NVM,and further improve the efficiency of the DRAM cache.We implement Mocha in an architectural simulator.Experimental results show that Mocha can improve application performance by 8.2%on average(up to 24.6%),reduce 6.9%energy consumption and 25.9%data migration traffic on average,compared with a typical hybrid memory architecture-HSCC. 展开更多
关键词 non-volatile memory hybrid memory system data migration fine-grained caching
原文传递
A survey on dynamic graph processing on GPUs: concepts, terminologies and systems
5
作者 Hongru GAO xiaofei liao +3 位作者 Zhiyuan SHAO Kexin LI Jiajie CHEN Hai JIN 《Frontiers of Computer Science》 SCIE EI CSCD 2024年第4期1-23,共23页
Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scena... Graphs that are used to model real-world entities with vertices and relationships among entities with edges,have proven to be a powerful tool for describing real-world problems in applications.In most real-world scenarios,entities and their relationships are subject to constant changes.Graphs that record such changes are called dynamic graphs.In recent years,the widespread application scenarios of dynamic graphs have stimulated extensive research on dynamic graph processing systems that continuously ingest graph updates and produce up-to-date graph analytics results.As the scale of dynamic graphs becomes larger,higher performance requirements are demanded to dynamic graph processing systems.With the massive parallel processing power and high memory bandwidth,GPUs become mainstream vehicles to accelerate dynamic graph processing tasks.GPU-based dynamic graph processing systems mainly address two challenges:maintaining the graph data when updates occur(i.e.,graph updating)and producing analytics results in time(i.e.,graph computing).In this paper,we survey GPU-based dynamic graph processing systems and review their methods on addressing both graph updating and graph computing.To comprehensively discuss existing dynamic graph processing systems on GPUs,we first introduce the terminologies of dynamic graph processing and then develop a taxonomy to describe the methods employed for graph updating and graph computing.In addition,we discuss the challenges and future research directions of dynamic graph processing on GPUs. 展开更多
关键词 dynamic graphs graph processing graph algorithms GPUS
原文传递
面向动态有向图的单调图算法硬件加速机制 被引量:1
6
作者 杨赟 余辉 +8 位作者 赵进 张宇 廖小飞 姜新宇 金海 刘海坤 毛伏兵 张吉 王彪 《中国科学:信息科学》 CSCD 北大核心 2023年第8期1575-1592,共18页
随着现实世界中动态图计算需求的快速增长,现有的研究工作已经提出了多种方法来有效支持单调图算法在动态有向图中的处理.然而,由于动态有向图的图结构频繁发生变化,其相邻图顶点之间的状态更新存在复杂的依赖关系,这使得现有的软硬件... 随着现实世界中动态图计算需求的快速增长,现有的研究工作已经提出了多种方法来有效支持单调图算法在动态有向图中的处理.然而,由于动态有向图的图结构频繁发生变化,其相邻图顶点之间的状态更新存在复杂的依赖关系,这使得现有的软硬件方法在处理单调图算法时依然面临着数据访问成本高和收敛速度慢的问题.为此,本文提出了一种面向动态有向图的单调图算法加速器DSGraph,它能够充分利用图顶点之间的依赖关系来加快单调图算法在动态有向图处理中的收敛速度,并有效降低数据访问成本.具体来说,DSGraph通过实时提取动态有向图中图顶点的局部拓扑依赖顺序来执行异步迭代处理,从而显著减少冗余的图顶点状态更新.同时,DSGraph设计了一种异步迭代流水线架构,其按照依赖顺序对图顶点状态进行异步迭代处理,从而加速图顶点状态传播速度并减少数据访问开销.最后,DSGraph提出了一种无阻塞数据同步机制,通过并行执行本地图顶点的状态更新和外部图顶点的数据同步来减少系统同步开销.实验显示,与目前最先进的面向单调图算法的动态图处理系统KickStarter相比,DSGraph将动态有向图处理速度平均提升了11.2倍. 展开更多
关键词 动态有向图 单调图算法 增量计算 依赖感知 图加速器
原文传递
图计算在ATPG中的应用探究
7
作者 毛伏兵 彭达 +7 位作者 张宇 廖小飞 姜新宇 杨赟 金海 赵进 刘海坤 王柳峥 《中国科学:信息科学》 CSCD 北大核心 2023年第2期211-233,共23页
ATPG(automatic test pattern generation)是VLSI(very large scale integration circuits)电路测试中非常重要的技术,它的好坏直接影响测试成本与开销.然而现有的并行ATPG方法普遍存在负载不均衡、并行策略单一、存储开销大和数据局部... ATPG(automatic test pattern generation)是VLSI(very large scale integration circuits)电路测试中非常重要的技术,它的好坏直接影响测试成本与开销.然而现有的并行ATPG方法普遍存在负载不均衡、并行策略单一、存储开销大和数据局部性差等问题.由于图计算的高并行度和高扩展性等优点,快速、高效、低存储开销和高可扩展性的图计算系统可能是有效支持ATPG的重要工具,这将对减少测试成本显得尤为重要.本文将对图计算在组合ATPG中的应用进行探究;介绍图计算模型将ATPG算法转化为图算法的方法;分析现有图计算系统应用于ATPG面临的挑战;提出面向ATPG的单机图计算系统,并从基于传统架构的优化、新兴硬件的加速和基于新兴存储器件的优化几个方面,对图计算系统支持ATPG所面临的挑战和未来研究方向进行了讨论. 展开更多
关键词 图计算 超大规模集成电路 自动测试向量生成 电子设计自动化 电路测试
原文传递
UCat: heterogeneous memory management for unikernels
8
作者 Chong TIAN Haikun LIU +1 位作者 xiaofei liao Hai JIN 《Frontiers of Computer Science》 SCIE EI CSCD 2023年第1期51-61,共11页
Unikernels provide an efficient and lightweight way to deploy cloud computing services in application-specialized and single-address-space virtual machines (VMs). They can efficiently deploy hundreds of unikernel-base... Unikernels provide an efficient and lightweight way to deploy cloud computing services in application-specialized and single-address-space virtual machines (VMs). They can efficiently deploy hundreds of unikernel-based VMs in a single physical server. In such a cloud computing platform, main memory is the primary bottleneck resource for high-density application deployment. Recently, non-volatile memory (NVM) technologies has become increasingly popular in cloud data centers because they can offer extremely large memory capacity at a low expense. However, there still remain many challenges to utilize NVMs for unikernel-based VMs, such as the difficulty of heterogeneous memory allocation and high performance overhead of address translations.In this paper, we present UCat, a heterogeneous memory management mechanism that support multi-grained memory allocation for unikernels. We propose front-end/back-end cooperative address space mapping to expose the host memory heterogeneity to unikernels. UCat exploits large pages to reduce the cost of two-layer address translation in virtualization environments, and leverages slab allocation to reduce memory waste due to internal memory fragmentation. We implement UCat based on a popular unikernel--OSv and conduct extensive experiments to evaluate its efficiency. Experimental results show that UCat can reduce the memory consumption of unikernels by 50% and TLB miss rate by 41%, and improve the throughput of real-world benchmarks such as memslap and YCSB by up to 18.5% and 14.8%, respectively. 展开更多
关键词 unikernel VIRTUALIZATION non-volatile memory heterogeneous memory large page slab allocation
原文传递
ReCSA:a dedicated sort accelerator using ReRAM-based content addressable memory
9
作者 Huize LI Hai JIN +2 位作者 Long ZHENG Yu HUANG xiaofei liao 《Frontiers of Computer Science》 SCIE EI CSCD 2023年第2期1-13,共13页
With the increasing amount of data,there is an urgent need for efficient sorting algorithms to process large data sets.Hardware sorting algorithms have attracted much attention because they can take advantage of diffe... With the increasing amount of data,there is an urgent need for efficient sorting algorithms to process large data sets.Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware's parallelism.But the traditional hardware sort accelerators suffer“memory wall”problems since their multiple rounds of data transmission between the memory and the processor.In this paper,we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously.Using this designed ReCAM array,we present ReCSA,which is the first dedicated ReCAM-based sort accelerator.Besides hardware designs,we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance.The sorting algorithm in ReCSA can process various data types,such as integer,float,double,and strings.We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators.The experimental results show that ReCSA has 90.92×,46.13×,27.38×,84.57×,and 3.36×speedups against CPU-,GPU-,FPGA-,NDP-,and PIM-based platforms when processing numeric data sets.ReCSA also has 24.82×,32.94×,and 18.22×performance improvement when processing string data sets compared with CPU-,GPU-,and FPGA-based platforms. 展开更多
关键词 ReCAM parallel sorting architecture design processing-in-memory
原文传递
Hypermethylated GPR135 gene expression is a favorable independent prognostic factor in nasopharyngeal carcinoma
10
作者 Chunqiao Gan Guanjie Qin +3 位作者 Shufang liao xiaofei liao Jinping Xu Wei Jiang 《Holistic Integrative Oncology》 2023年第1期248-255,共8页
Purpose To investigate the methylation status and expression level of G protein-coupled receptor 135(GPR135)in nasopharyngeal carcinoma(NPC)and determine its prognostic value.Methods The GPR135 methylation data of NPC... Purpose To investigate the methylation status and expression level of G protein-coupled receptor 135(GPR135)in nasopharyngeal carcinoma(NPC)and determine its prognostic value.Methods The GPR135 methylation data of NPC and normal nasopharyngeal tissues were obtained from the Gene Expression Omnibus(GEO)GSE52068 dataset.The GPR135 promoter region methylation level in four normal nasopharyngeal epithelial tissues and eight NPC tissues was detected by bisulfite sequencing.GPR135 expression in NPC and normal nasopharyngeal tissue was obtained from the GEO GSE13597 dataset.The GPR135 mRNA expression levels in 13 NPC and 26 healthy control tissues were assessed with quantitative real-time PCR(qRT-PCR).The GPR135 expression level in 124 NPC tissue sections was analyzed by immunohistochemistry.The correlation between GPR135 expression and clinicopathological features was analyzed by a chi-square test.GPR135 expression in patients with NPC was evaluated by immunohistochemistry,and its influence on prognosis was assessed by Kaplan-Meier and Cox regression analyses.Results The bisulfite sequencing demonstrated that the GPR135 promoter region was highly methylated in NPC tissues.The immunohistochemistry results revealed that patients with high GPR135 expression had better overall survival(hazard ratio[HR]=0.177,95%confidence interval[95%CI]:0.072–0.437,P=0.008),disease-free survival(HR=0.4401,95%CI:0.222–0.871,P=0.034),and local recurrence-free survival(HR=0.307,95%CI:0.119–0.790,P=0.046)than those with low GPR135 expression.Conclusion GPR135 is hypermethylated in NPC,where high GPR135 expression indicates a positive prognosis.Therefore,GPR135 might be a prognostic indicator. 展开更多
关键词 Nasopharyngeal carcinoma GPR135 Prognostic value METHYLATION IMMUNOHISTOCHEMISTRY
原文传递
一种高效的面向动态有向图的增量强连通分量算法 被引量:6
11
作者 廖小飞 陈意诚 +3 位作者 张宇 金海 刘海坤 赵进 《中国科学:信息科学》 CSCD 北大核心 2019年第8期988-1004,共17页
强连通分量(strongly connected component, SCC)算法可以将一个有向图缩略为有向无环图(directed acyclic graph, DAG),广泛应用于可达性查询等有向图分析应用.尽管现有工作已经提出多种面向静态有向图的强连通分量算法,但是它们需要... 强连通分量(strongly connected component, SCC)算法可以将一个有向图缩略为有向无环图(directed acyclic graph, DAG),广泛应用于可达性查询等有向图分析应用.尽管现有工作已经提出多种面向静态有向图的强连通分量算法,但是它们需要高额的运行时开销来反复对整个图进行全量计算,以响应现实世界中普遍存在的动态有向图结构的频繁变化.其实,在通常情况下,动态有向图每次改变量极小(少于5%).其允许我们以增量的方式对动态有向图进行强连通分量计算,以缩短响应时间.因此,为解决此问题,本文提出了一种高效的面向动态有向图的增量强连通分量算法Incremental Strongly Connected Components Algorithm,简称Inc-SCC,通过对不必要的计算进行裁剪以减少算法的数据访问量和计算量,并利用SCC的不相交性进行并行处理以提升SCC计算效率.其次,提出了一种启发式优化方法进一步加快算法收敛速度.实验结果显示,本方法可以用于实时响应有向图持续性动态变化,并且当整个有向图的边变化比例为5%时,本方法相对于现有算法的加速比可达2.8到12倍,当整个有向图的边变化比例为0.5%时,本方法相对于现有算法的加速比可达2.9到12倍. 展开更多
关键词 强连通分量 动态有向图 增量计算 收敛 有向无环图
原文传递
一种高效的面向高并发图分析任务的存储系统 被引量:3
12
作者 赵进 姜新宇 +7 位作者 张宇 廖小飞 金海 刘海坤 杨赟 张吉 王彪 余婷 《中国科学:信息科学》 CSCD 北大核心 2022年第1期111-128,共18页
随着现实世界中图计算需求的快速增长,同一平台上往往并发运行着大量迭代图分析任务.然而,现有的图计算系统主要是为了高效执行单个图分析任务而设计的.因此,当多个并发图分析任务同时在同一个底层图上并行执行时,现有图计算系统会面临... 随着现实世界中图计算需求的快速增长,同一平台上往往并发运行着大量迭代图分析任务.然而,现有的图计算系统主要是为了高效执行单个图分析任务而设计的.因此,当多个并发图分析任务同时在同一个底层图上并行执行时,现有图计算系统会面临巨大的数据访问开销.为了提高并发图分析任务的吞吐量,现有的核外并发图处理方案通过共享图数据减少并发任务的数据存储与访问开销.但是,由于现实世界中图的图顶点度数幂律分布特性以及图分析任务之间的差异性,现有方案在访问数据时依旧存在着大量的不必要的冗余I/O开销.这是因为即使静态图分区中绝大部分顶点处于非活跃状态或者只被少数图分析任务共享,现有方法也依旧会将整个分区加载入内存供并发图分析任务处理.为解决上述问题,本文提出了一个面向并发图分析任务的高效存储系统GraphDP.它能够插入到现有核外图计算系统中来透明有效地减少现有图计算系统执行并发图分析任务时的存储消耗与数据访问开销,从而提高并发图分析任务的吞吐量.具体来说,GraphDP使用一种新颖的动态I/O调度策略,能够使系统以最优的I/O访问方式完成图数据的加载,并有效地减少加载到内存和cache的数据.同时,GraphDP通过高效的缓存机制在内存中优先缓存被频繁访问的图数据,从而进一步减少数据访问开销.为证明GraphDP的有效性,我们将GraphDP插入到目前流行的核外图计算系统中,包括GridGraph,GraphChi和X-Stream.实验结果表明,GraphDP分别将GridGraph,GraphChi和X-Stream的吞吐量提高了1.57~2.19倍,1.86~2.37倍和1.62~2.21倍. 展开更多
关键词 迭代图处理 并发任务 存储系统 I/O开销 吞吐量
原文传递
图数据中极大团枚举问题的求解:研究现状与挑战 被引量:2
13
作者 许绍显 廖小飞 +2 位作者 邵志远 华强胜 金海 《中国科学:信息科学》 CSCD 北大核心 2022年第5期784-803,共20页
随着大数据时代的到来,图数据挖掘成为了一个热门的研究方向.极大团枚举(maximal clique enumeration,MCE)作为图论中的一个基本问题,在很多领域都有着广泛的应用.然而,鉴于极大团枚举问题本身的复杂性以及现实图数据规模的飞速增长,在... 随着大数据时代的到来,图数据挖掘成为了一个热门的研究方向.极大团枚举(maximal clique enumeration,MCE)作为图论中的一个基本问题,在很多领域都有着广泛的应用.然而,鉴于极大团枚举问题本身的复杂性以及现实图数据规模的飞速增长,在现实图数据上进行极大团枚举是很耗时的.目前已经有大量的工作对该问题的求解算法进行改进,或采用各种计算优化方法减少算法的运行时间.本文就极大团枚举问题做了如下工作:对现有的极大团枚举问题的研究工作进行了分类归纳;对极大团枚举问题的研究现状进行了详细介绍;对该问题进一步发展所面临的挑战和发展方向进行了讨论和展望. 展开更多
关键词 极大团枚举 图论 图数据挖掘 图划分 并行计算
原文传递
An effective framework for asynchronous incremental graph processing 被引量:5
14
作者 Xinqiao LV Wei XIAO +3 位作者 Yu ZHANG xiaofei liao Hai JIN Qiangsheng HUA 《Frontiers of Computer Science》 SCIE EI CSCD 2019年第3期539-551,共13页
Although many graph processing systems have been proposed, graphs in the real-world are often dynamic. It is important to keep the results of graph computation up-todate. Incremental computation is demonstrated to be ... Although many graph processing systems have been proposed, graphs in the real-world are often dynamic. It is important to keep the results of graph computation up-todate. Incremental computation is demonstrated to be an efficient solution to update calculated results. Recently, many incremental graph processing systems have been proposed to handle dynamic graphs in an asynchronous way and are able to achieve better performance than those processed in a synchronous way. However, these solutions still suffer from sub-optimal convergence speed due to their slow propagation of important vertex state (important to convergence speed) and poor locality. In order to solve these problems, we propose a novel graph processing framework. It introduces a dynamic partition method to gather the important vertices for high locality, and then uses a priority-based scheduling algorithm to assign them with a higher priority for an effective processing order. By such means, it is able to reduce the number of updates and increase the locality, thereby reducing the convergence time. Experimental results show that our method reduces the number of updates by 30%, and reduces the total execution time by 35%, compared with state-of-the-art systems. 展开更多
关键词 incremental computation graph processing iterative computation ASYNCHRONOUS CONVERGENCE
原文传递
Resource abstraction and data placement for distributed hybrid memory pool 被引量:1
15
作者 Tingting CHEN Haikun LIU +1 位作者 xiaofei liao Hai JIN 《Frontiers of Computer Science》 SCIE EI CSCD 2021年第3期47-57,共11页
Emerging byte-addressable non-volatile memory(NVM)technologies offer higher density and lower cost than DRAM,at the expense of lower performance and limited write endurance.There have been many studies on hybrid NVM/D... Emerging byte-addressable non-volatile memory(NVM)technologies offer higher density and lower cost than DRAM,at the expense of lower performance and limited write endurance.There have been many studies on hybrid NVM/DRAM memory management in a single physical server.However,it is still an open problem on how to manage hybrid memories efficiently in a distributed environment.This paper proposes Alloy,a memory resource abstraction and data placement strategy for an RDMA-enabled distributed hybrid memory pool(DHMP).Alloy provides simple APIs for applications to utilize DRAM or NVM resource in the DHMP,without being aware of the hardware details of the DHMP.We propose a hotness-aware data placement scheme,which combines hot data migration,data replication and write merging together to improve application performance and reduce the cost of DRAM.We evaluate Alloy with several micro-benchmark workloads and public benchmark workloads.Experimental results show that Alloy can significantly reduce the DRAM usage in the DHMP by up to 95%,while reducing the total memory access time by up to 57%compared with the state-of-the-art approaches. 展开更多
关键词 load balance distributed hybrid memory CLOUDS
原文传递
Writeback throttling in a virtualized system with SCM 被引量:1
16
作者 Dingding LI xiaofei liao +2 位作者 Hai JIN Yong TANG Gansen ZHAO 《Frontiers of Computer Science》 SCIE EI CSCD 2016年第1期82-95,共14页
Storage class memory (SCM) has the potential to revolutionize the memory landscape by its non-volatile and byte-addressable properties. However, there is little published work about exploring its usage for modem vir... Storage class memory (SCM) has the potential to revolutionize the memory landscape by its non-volatile and byte-addressable properties. However, there is little published work about exploring its usage for modem virtualized cloud infrastructure. We propose SCM-vWrite, a novel architecture designed around SCM, to ease the performance interference of virtualized storage subsystem. Through a case study on a typical virtualized cloud system, we first describe why cur- rent writeback manners are not suitable for a virtualized en- vironment, then design and implement SCM-vWrite to im- prove this problem. We also use typical benchmarks and re- alistic workloads to evaluate its performance. Compared with the traditional method on a conventional architecture, the ex- perimental result shows that SCM-vWrite can coordinate the writeback flows more effectively among multiple co-located guest operating systems, achieving a better disk I/O perfor- mance without any loss of reliability. 展开更多
关键词 VIRTUALIZATION storage class memory writeback
原文传递
FunctionFlow:coordinating parallel tasks
17
作者 Xuepeng FAN xiaofei liao Hai JIN 《Frontiers of Computer Science》 SCIE EI CSCD 2019年第1期73-85,共13页
With the growing allel programming,nowadays popularity of task-based partask-parallel programming libraries and languages are still with limited support for coordinating parallel tasks.Such limitation forces programme... With the growing allel programming,nowadays popularity of task-based partask-parallel programming libraries and languages are still with limited support for coordinating parallel tasks.Such limitation forces programmers to use additional independent components to coordinate the parallel tasks -the components can be third-party libraries or additional components in the same programming library or language.Moreover,mixing tasks and coordination components increase the difficulty of task-based programming,and blind schedulers for understanding tasks'dependencies. In this paper,we propose a task-based parallel programming library,Function Flow,which coordinates tasks in the purpose of avoiding additional independent coordination components.First,we use dependency expression to represent ubiquitous tasks'termination.The key idea behind dependency expression is to use && for both task's termination and Ⅱ for any task termination,along with the combination of dependency expressions.Second,as runtime support,we use a lightweight representation for dependency expression. Also,we use suspended-task queue to schedule tasks that still have prerequisites to run. Finally,we demonstrate Function Flow's effectiveness in two aspects,case study about implementing popular parallel patterns with FunctionFIow,and performance comparision with state-of-the-art practice,TBB.Our demonstration shows that FunctionFIow can generally coordinate parallel tasks without involving additional components,along with comparable performance with TBB. 展开更多
关键词 TASK parallel PROGRAMMING TASKS DEPENDENCY FunctionFlow COORDINATION PATTERNS
原文传递
Superpage-Friendly Page Table Design for Hybrid Memory Systems
18
作者 Xiaoyuan Wang Haikun Liu +1 位作者 xiaofei liao Hai Jin 《国际计算机前沿大会会议论文集》 2020年第1期623-641,共19页
Page migration has long been adopted in hybrid memory systems comprising dynamic random access memory(DRAM)and non-volatile memories(NVMs),to improve the system performance and energy efficiency.However,page migration... Page migration has long been adopted in hybrid memory systems comprising dynamic random access memory(DRAM)and non-volatile memories(NVMs),to improve the system performance and energy efficiency.However,page migration introduces some side effects,such as more translation lookaside buffer(TLB)misses,breaking memory contiguity,and extra memory accesses due to page table updating.In this paper,we propose superpagefriendly page table called SuperPT to reduce the performance overhead of serving TLB misses.By leveraging a virtual hashed page table and a hybrid DRAM allocator,SuperPT performs address translations in a flexible and efficient way while still remaining the contiguity within the migrated pages. 展开更多
关键词 Page table Hybrid memory system Page migration Multiple page sizes Address translation
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部