期刊文献+
共找到20篇文章
< 1 >
每页显示 20 50 100
A multilevel preconditioner and its shared memory implementation for a new generation reservoir simulator 被引量:2
1
作者 Wu Shuhong Xu Jinchao +6 位作者 Feng Chunsheng Zhang Chen-Song Li Qiaoyun Shu Shi Wang Baohua Li Xiaobo Li Hua 《Petroleum Science》 SCIE CAS CSCD 2014年第4期540-549,共10页
As a result of the interplay between advances in computer hardware, software, and algorithm, we are now in a new era of large-scale reservoir simulation, which focuses on accurate flow description, fine reservoir char... As a result of the interplay between advances in computer hardware, software, and algorithm, we are now in a new era of large-scale reservoir simulation, which focuses on accurate flow description, fine reservoir characterization, efficient nonlinear/linear solvers, and parallel implementation. In this paper, we discuss a multilevel preconditioner in a new-generation simulator and its implementation on multicore computers. This preconditioner relies on the method of subspace corrections to solve large-scale linear systems arising from fully implicit methods in reservoir simulations. We investigate the parallel efficiency and robustness of the proposed method by applying it to million-cell benchmark problems. 展开更多
关键词 MULTILEVEL PRECONDITIONER shared memory large-scale linear system reservoir simulation
下载PDF
Design of efficient parallel algorithms on shared memory multiprocessors
2
作者 Qiao Xiangzhen (Institute of Computing Technology, Chinese Academg of Science Beijing 100080, P. R. China) 《Wuhan University Journal of Natural Sciences》 CAS 1996年第Z1期344-349,共6页
The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algori... The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algorithms. The design of efficient parallel algorithms should be based on the following considerationst algorithm parallelism and the hardware-parallelism; granularity of the parallel algorithm, algorithm optimization according to the underling parallel machine. In this paper , these principles are applied to solve a model problem of the PDE. The speedup of the new method is high. The results were tested and evaluated on a shared memory MIMD machine. The practical results were agree with the predicted performance. 展开更多
关键词 parallel algorithm shared memory multiprocessor parallel granularity optimization.
下载PDF
PERFORMANCE COMPARISON OF CELL-BASED AND PACKET-BASED SWITCHING SCHEMES FOR SHARED MEMORY SWITCHES
3
作者 XiKang GeNing FengChongxi 《Journal of Electronics(China)》 2004年第1期55-63,共9页
Shared Memory (SM) switches are widely used for its high throughput, low delay and efficient use of memory. This paper compares the performance of two prominent switching schemes of SM packet switches: Cell-Based Swit... Shared Memory (SM) switches are widely used for its high throughput, low delay and efficient use of memory. This paper compares the performance of two prominent switching schemes of SM packet switches: Cell-Based Switching (CBS) and Packet-Based Switching (PBS).Theoretical analysis is carried out to draw qualitative conclusion on the memory requirement,throughput and packet delay of the two schemes. Furthermore, simulations are carried out to get quantitative results of the performance comparison under various system load, traffic patterns,and memory sizes. Simulation results show that PBS has the advantage of shorter time delay while CBS has lower memory requirement and outperforms in throughput when the memory size is limited. The comparison can be used for tradeoff between performance and complexity in switch design. 展开更多
关键词 shared memory switch Packet switching Cell switching THROUGHPUT Packet delay
下载PDF
Pragma Directed Shared Memory Centric Optimizations on GPUs 被引量:1
4
作者 Jing Li CCF, Lei Liu +4 位作者 Yuan Wu Xiang-Hua Liu Yi Gao Xiao-Bing Feng Cheng-YongWu 《Journal of Computer Science & Technology》 SCIE EI CSCD 2016年第2期235-252,共18页
GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improv... GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource: Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas. 展开更多
关键词 GPU shared memory pragma directed data centric
原文传递
PsmArena:Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications
5
作者 Zhang Yang Aiqing Zhang Zeyao Mo 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2021年第3期287-295,共9页
The Distributed Shared Memory(DSM)architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap,and it inevitably exhibits Non-Uniform Memory Access(NUMA)to shared-memory ... The Distributed Shared Memory(DSM)architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap,and it inevitably exhibits Non-Uniform Memory Access(NUMA)to shared-memory parallel applications.Failure to adapt to the NUMA effect can significantly downgrade application performance,especially on today’s manycore platforms with tens to hundreds of cores.However,traditional approaches such as first-touch and memory policy fall short in false page-sharing,fragmentation,or ease of use.In this paper,we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation.Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores. 展开更多
关键词 partitioned shared memory Non-Uniform memory Access(NUMA) heap manager multithread manycore
原文传递
Optimizing a Parallel Video Encoder with Message Passing and a Shared Memory Architecture
6
作者 谷俊丽 孙义和 《Tsinghua Science and Technology》 SCIE EI CAS 2011年第4期393-398,共6页
Implementing video applications on emerging multi-core processors is a promising technique for personal, real-time multi-media applications. However, when porting the legacy parallel video encoders developed for clust... Implementing video applications on emerging multi-core processors is a promising technique for personal, real-time multi-media applications. However, when porting the legacy parallel video encoders developed for clusters to shared-memory multi-cores, the existing parallel algorithms result in workload imbalances on different cores and communication inefficiencies. This paper describes a strip-wise parallel scheme to balance workloads and a hybrid communication mechanism to reduce communication overhead. The implementation of the H.264 parallel encoder on an eight CPU Intel Xeon system achieves 5x to 6x speed-up over a single thread encoder and achieves a 29% performance improvement over the commonly used master-slave schemes on clusters. The paper also gives further analysis on scalability, parallel efficiency, workload balance, and communication overhead as the number of cores varies. 展开更多
关键词 parallel video encoder speed improvement message passing shared memory
原文传递
The Model of Asynchronous Parallel Nonlinear Multisplitting Method on Shared Memory System
7
作者 Yang Cao Qingyang Li(Dept. of Applied Mathematics, Tsinghua Universitg Beijing 100084, P.R. of China) 《Wuhan University Journal of Natural Sciences》 CAS 1996年第Z1期483-489,共7页
Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill whic... Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill which the iteration is executed asynchronously: Each processor calculate the solution of an individual nonlinear system belong to its nonlinear multisplitting and can update the global approximation residing in the shared memory at any time. A local convergence analysis of this model is presented. Finally, we give a uumerical example which shows a 'strange' property that speedup Sp > p and efficiency Ep > 1. 展开更多
关键词 Asynchronous Parallel Nonlinear Multisplitting Method shared memory processors Efficiency. Speedup.
下载PDF
PERFORMANCE ANALYSIS OF MULTICAST REPLICATION MECHANISM IN SHARED-MEMORY SWITCH WITH SPEEDUP
8
作者 WangWeizhang GeNing FengChongxi 《Journal of Electronics(China)》 2004年第3期198-205,共8页
A multicast replication algorithm is proposed for shared memory switches. It uses a dedicated FIFO to multicast by replicating cells at receiver and the FIFO is operating with shared memory in parallel. Speedup is use... A multicast replication algorithm is proposed for shared memory switches. It uses a dedicated FIFO to multicast by replicating cells at receiver and the FIFO is operating with shared memory in parallel. Speedup is used to promote loss and delay performance. A new queueing analytical model is developed based on a sub-timeslot approach. The system performance in terms of cell loss and delay is analyzed and verified by simulation. 展开更多
关键词 SWITCH shared memory switch MULTICAST Cell loss
下载PDF
DESIGN AND IMPLEMENTATION OF SINGLE-BUFFERED ROUTERS
9
作者 Hu Ximing Qu Jing +1 位作者 Wang Binqiang Wu Jiangxing 《Journal of Electronics(China)》 2007年第4期470-476,共7页
A Single-Buffered (SB) router is a router where only one stage of shared buffering is sandwiched between two interconnects in comparison of a Combined Input and Output Queued (CIOQ) router where a central switch f... A Single-Buffered (SB) router is a router where only one stage of shared buffering is sandwiched between two interconnects in comparison of a Combined Input and Output Queued (CIOQ) router where a central switch fabric is sandwiched between two stages of buffering. The notion of SB routers was firstly proposed by the High-Performance Networking Group (HPNG) of Stanford University, along with two promising designs of SB routers: one of which was Parallel Shared Memory (PSM) router and the other was Distributed Shared Memory (DSM) router. Admittedly, the work of HPNG deserved full credit, but all results presented by them appeared to relay on a Centralized Memory Management Algorithm (CMMA) which was essentially impractical because of the high processing and communication complexity. This paper attempts to make a scalable high-speed SB router completely practical by introducing a fully distributed architecture for managing the shared memory of SB routers. The resulting SB router is called as a Virtual Output and Input Queued (VOIQ) router. Furthermore, the scheme of VOIQ routers can not only eliminate the need for the CMMA scheduler, thus allowing a fully distributed implementation with low processing and commu- nication complexity, but also provide QoS guarantees and efficiently support variable-length packets in this paper. In particular, the results of performance testing and the hardware implementation of our VOIQ-based router (NDSC~ SR1880-TTM series) are illustrated at the end of this paper. The proposal of this paper is the first distributed scheme of how to design and implement SB routers publicized till now. 展开更多
关键词 Single-Buffered (SB) router Distributed shared memory (DSM) Parallel shared memory (PSM) Virtual Output and Input Queued (VOIQ) NDSC SR1880-T^TM router
下载PDF
Development of Ubiquitous Simulation Service Structure Based on High Performance Computing Technologies 被引量:2
10
作者 Sang-Hyun CHO Jeong-Kil CHOI 《Journal of Materials Science & Technology》 SCIE EI CAS CSCD 2008年第3期374-378,共5页
The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that ... The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that developers can make ideal casting designs. To take the prior occupation at commercial simulation market, so many development groups in the world are doing their every effort. They already reported successful stories in manufacturing fields by developing and providing the high performance simulation technologies for multipurpose. But they all run at powerful desk-side computers by well-trained experts mainly, so that it is hard to diffuse the scientific designing concept to newcomers in casting field. To overcome upcoming problems in scientific casting designs, we utilized information technologies and full-matured hardware backbones to spread out the effective and scientific casting design mind, and they all were integrated into Simulation Portal on the web. It professes scientific casting design on the NET including ubiquitous access way represented by "Anyone, Anytime, Anywhere" concept for casting designs. 展开更多
关键词 Parallel computation Message passing interface (MPI) shared memory processing (SMP) CLUSTERING UBIQUITOUS
下载PDF
Design of Timing Synchronization Software on EAST-NBI
11
作者 赵远哲 胡纯栋 +1 位作者 盛鹏 张小丹 《Plasma Science and Technology》 SCIE EI CAS CSCD 2013年第12期1237-1240,共4页
To ensure the uniqueness and recognition of data and make it easy to analyze and process the data of all subsystems of the neutral beam injector (NBI), it is required that all subsystems have a unified system time. ... To ensure the uniqueness and recognition of data and make it easy to analyze and process the data of all subsystems of the neutral beam injector (NBI), it is required that all subsystems have a unified system time. In this paper, the timing synchronization software is presented which is related to many kinds of technologies, such as shared memory, multithreading, TCP protocol and so on. Shared memory helps the server save the information of clients and system time, multithreading can deal with different clients with different threads, the server works under Linux operating system, the client works under Linux operating system and Windows operating system. With the help of this design, synchronization of all subsystems can be achieved in less than one second, and this accuracy is enough for the NBI system and the reliability of data is thus ensured. 展开更多
关键词 EAST NBI timing synchronization shared memory MULTITHREADING SERVER/CLIENT
下载PDF
Out-of-Order Execution in Sequentially Consistent Shared-Memory Systems:Theory and Experiments
12
作者 胡伟武 water.chpc.ict.ac.cn +1 位作者 夏培肃 water.chpc.ict.ac.cn 《Journal of Computer Science & Technology》 SCIE EI CSCD 1998年第2期125-140,共16页
Thaditional implementation of sequential consistency in shared-memory systems requires memory accesses to be globally performed in program order. Based on an event ordering model for correct executions in shared-memor... Thaditional implementation of sequential consistency in shared-memory systems requires memory accesses to be globally performed in program order. Based on an event ordering model for correct executions in shared-memory systems, this paper proposes and proves that out-of-order execution does not influence the correctness of an execution providing certain condition is met. Simulation results show that out-of-order execution proposed in this paper is an effective way to improve the performance of a sequentially consistent shared-memory system. 展开更多
关键词 shared memory sequential consistency event ordering write atomic out-of-order execution simulation
原文传递
A Shared Buffer Memory ATM Access Switch 被引量:1
13
作者 YuHao ZhuXinning 《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 1998年第1期34-38,43,共6页
This paper proposes a Shared Buffer Memory ATM Access Switch . The switches have significant benefits over Crossbar or Bus Based switches because its output buffer memories are shared by all the switch output ports an... This paper proposes a Shared Buffer Memory ATM Access Switch . The switches have significant benefits over Crossbar or Bus Based switches because its output buffer memories are shared by all the switch output ports and are allotted to one particular output port as the occasion demands. As Buffer allocation schemes in the ATM Switches is Partial Sharing, it is trade-off between Complete Sharing and Dedicated Allocation. In addition, the queuing structures used in the shared memory are independent of both the data path through the switch and the cell scheduling mechanism. The method for queue management is simple and effective. 展开更多
关键词 ATM switch shared buffer memory partial sharing queue management
原文传递
Adapting Memory Hierarchies for Emerging Datacenter Interconnects 被引量:1
14
作者 江涛 侯锐 +5 位作者 董建波 柴琳 Sally A. McKee 田斌 张立新 孙凝晖 《Journal of Computer Science & Technology》 SCIE EI CSCD 2015年第1期97-109,共13页
Efficient resource utilization requires that emerging datacenter interconnects support both high performance communication and efficient remote resource sharing. These goals require that the network be more tightly co... Efficient resource utilization requires that emerging datacenter interconnects support both high performance communication and efficient remote resource sharing. These goals require that the network be more tightly coupled with the CPU chips. Designing a new interconnection technology thus requires considering not only the interconnection itself, but also the design of the processors that will rely on it. In this paper, we study memory hierarchy implications for the design of high-speed datacenter interconnects particularly as they affect remote memory access -- and we use PCIe as the vehicle for our investigations. To that end, we build three complementary platforms: a PCIe-interconnected prototype server with which we measure and analyze current bottlenecks; a software simulator that lets us model microarchitectural and cache hierarchy changes; and an FPGA prototype system with a streamlined switchless customized protocol Thunder with which we study hardware optimizations outside the processor. We highlight several architectural modifications to better support remote memory access and communication, and quantify their impact and ]imitations. 展开更多
关键词 high-speed interconnect memory hierarchy time shared memory datacenter network
原文传递
Performance of Text-Independent Automatic Speaker Recognition on a Multicore System
15
作者 Rand Kouatly Talha Ali Khan 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2024年第2期447-456,共10页
This paper studies a high-speed text-independent Automatic Speaker Recognition(ASR)algorithm based on a multicore system's Gaussian Mixture Model(GMM).The high speech is achieved using parallel implementation of t... This paper studies a high-speed text-independent Automatic Speaker Recognition(ASR)algorithm based on a multicore system's Gaussian Mixture Model(GMM).The high speech is achieved using parallel implementation of the feature's extraction and aggregation methods during training and testing procedures.Shared memory parallel programming techniques using both OpenMP and PThreads libraries are developed to accelerate the code and improve the performance of the ASR algorithm.The experimental results show speed-up improvements of around 3.2 on a personal laptop with Intel i5-6300HQ(2.3 GHz,four cores without hyper-threading,and 8 GB of RAM).In addition,a remarkable 100%speaker recognition accuracy is achieved. 展开更多
关键词 Automatic Speaker Recognition(ASR) Gaussian Mixture Model(GMM) shared memory parallel programming PThreads OPENMP
原文传递
Efficient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers 被引量:1
16
作者 Benjamín Sahelices Agustín de Dios +2 位作者 Pablo Ibáez Víctor Vials-Yúfera José María Llabería 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第1期75-91,共17页
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order ... Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters. 展开更多
关键词 distributed shared memory multiprocessors synchronization buffer coherence controller request bypass
原文传递
Evaluation of Remote-I/O Support for a DSM-Based Computation Offloading Scheme
17
作者 Yuhun Jun Jaemin Lee Euiseong Seo 《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第5期957-973,共17页
Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading a... Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading approach is expected to be especially popular in the near future because it can dynamically migrate running threads to computing nodes and does not require any modifications of existing applications to do so. The current DSM-based computation offloading scheme, however, has focused on efficiently offloading computationally intensive applications and has not considered the significant performance degradation caused by processing the I/O requests issued by offloaded threads. Because most mobile applications are interactive and thus yield frequent I/O requests, efficient handling of I/O operations is critically important. In this paper, we quantitatively analyze the performance degradation caused by I/O processing in DSM-based computation offloading schemes using representative commodity applications. To remedy the performance degradation, we apply a remote I/O scheme based on remote device support to computation offloading. The proposed approach improves the execution time by up to 43.6% and saves up to 17.7% of energy consumption in comparison with the existing offloading schemes. Selective compression of the remote I/O scheme reduces the network traffic by up to 53.5%. 展开更多
关键词 computation offloading mobile-cloud computing distributed shared memory (DSM) mobile computing
原文传递
NONH:A New Cache-Based Coherence Protocol for Linked List Structure DSM System and Its Performance Evaluation
18
作者 房至一 鞠九滨 《Journal of Computer Science & Technology》 SCIE EI CSCD 1996年第4期405-415,共11页
The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improvi... The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list. This paper presents the design of a new management protocol-NONH (New-OwnerNew-Head) and its performance evaluation. The analysis results show that thisprotocol can improve the scalability and performence of a coherent DSM system using linked list. It is also suitable for managing the cache coherency in tree-like hierarchical architecture. 展开更多
关键词 Linked list cache coherence distributed shared memory
原文传递
NonlinearAsynchronousBlockIterativeMethod
19
作者 李庆扬 曹阳 田肇云 《Tsinghua Science and Technology》 EI CAS 1996年第3期73-77,共5页
This paper proposes a class of asynchronous block iterative methods for solving large scale nonlinear equations F(x)=0 and proves local convergence. This method splits F into p blocks, then does the asynch... This paper proposes a class of asynchronous block iterative methods for solving large scale nonlinear equations F(x)=0 and proves local convergence. This method splits F into p blocks, then does the asynchronous parallel iteration on the p multiprocessor with shared memory. Because each processor need only solve equations with a low dimension and there is no synchronous waiting time, the parallel efficiency can be increased. Finally, we give the results of the numerical test of three kinds of Newton like asynchronous block iteration methods which run well on a multiprocessor system. These results show that the parallel efficiency is very high. 展开更多
关键词 nonlinear equations asynchronous block iterative method (ABI method) shared memory processors
原文传递
Compressed page walk cache
20
作者 Dunbo ZHANG Chaoyang JIA Li SHEN 《Frontiers of Computer Science》 SCIE EI CSCD 2022年第3期41-52,共12页
GPUs are widely used in modem high-performance computing systems.To reduce the burden of GPU programmers,operating system and GPU hardware provide great supports for shared virtual memory,which enables GPU and CPU to ... GPUs are widely used in modem high-performance computing systems.To reduce the burden of GPU programmers,operating system and GPU hardware provide great supports for shared virtual memory,which enables GPU and CPU to share the same virtual address space.Unfortunately,the current SIMT execution model of GPU brings great challenges for the virtual-physical address translation on the GPU side,mainly due to the huge number of virtual addresses which are generated simultaneously and the bad locality of these virtual addresses.Thus,the excessive TLB accesses increase the miss ratio of TLB.As an attractive solution,Page Walk Cache(PWC)has received wide attention for its capability of reducing the memory accesses caused by TLB misses.However,the current PWC mechanism suffers from heavy redundancies,which significantly limits its efficiency.In this paper,we first investigate the facts leading to this issue by evaluating the performance of PWC with typical GPU benchmarks.We find that the repeated L4 and L3 indices of virtual addresses increase the redundancies in PWC,and the low locality of L2 indices causes the low hit ratio in PWC.Based on these observations,we propose a new PWC structure,namely Compressed Page Walk Cache(CPWC),to resolve the redundancy burden in current PWC.Our CPWC can be organized in either direct-mapped mode or set-associated mode.Experimental results show that CPWC increases by 3 times over TPC in the number of page table entries,increases by 38.3%over PWC in L2 index hit ratio and reduces by 26.9%in the memory accesses of page tables.The average memory accesses caused by each TLB miss is reduced to 1.13.Overall,the average IPC can improve by 25.3%. 展开更多
关键词 GPU shared virtual memory address translation PWC
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部