A method for automatic verification of cache coherence protocols is presented, in which cache coherence protocols are modeled as concurrent value-passing processes, and control and data consistency requirement are des...A method for automatic verification of cache coherence protocols is presented, in which cache coherence protocols are modeled as concurrent value-passing processes, and control and data consistency requirement are described as formulas in first-order p-calculus. A model checker is employed to check if the protocol under investigation satisfies the required properties. Using this method a data consistency error has been revealed in a well-known cache coherence protocol. The error has been corrected, and the revised protocol has been shown free from data consistency error for any data domain size, by appealing to data independence technique.展开更多
Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors. Although scalable to a certain extent, directory protocols are complex enough to prevent it from being u...Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors. Although scalable to a certain extent, directory protocols are complex enough to prevent it from being used in very large scale multiprocessors with tens of thousands of nodes. This paper proposes a lock-based cache coherence protocol for scope conyistency. It does not rely on directory information to maintain cache coherence. Instead, cache coherence is mailltained through requiring the releasing processor of a lock to store all write-notices generated in the associated critical section to the lock and the acquiring processor invalidates or updates its locally cached data copies according to the write notices of the lock. To evaluate the performance of the lock-based cache coherence protocol, a software DSM system named JIAJIA is built on network of workstations. Besides the lockbased cache coherence protocol, JIAJIA also characterizes itself with its shared memory organization scheme which combines the physical memories of multiple workstations to form a large shared space. Performance measurements with SPLASH2 program suite and NAS benchmarks indicate that, compared to recent SVM systems such as CVM, higher speedup is achieved by JIAJIA.Besides, JIAJIA can solve large scale problems that cannot be solved by other SVM systems due to memory size limitation.展开更多
This paper analyzes cache coherency mechanism from the view of system. It firstly discusses caehe-memory hierarchy of Pentium Ⅲ SMP system, including memory area distribution, cache attributes control and bus transac...This paper analyzes cache coherency mechanism from the view of system. It firstly discusses caehe-memory hierarchy of Pentium Ⅲ SMP system, including memory area distribution, cache attributes control and bus transaction. Secondly it analyzes hardware snoopy mechanism of P6 bus and MESI state transitions adopted by Pentium Ⅲ. Based on these, it focuses on how muhiprocessors and the P6 bus cooperate to ensure cache coherency of the whole system, and gives the key of cache coherency design.展开更多
In the field of supercomputing, one key issue for scal-able shared-memory multiprocessors is the design of the directory which denotes the sharing state for a cache block. A good direc-tory design intends to achieve t...In the field of supercomputing, one key issue for scal-able shared-memory multiprocessors is the design of the directory which denotes the sharing state for a cache block. A good direc-tory design intends to achieve three key attributes: reasonable memory overhead, sharer position precision and implementation complexity. However, researchers often face the problem that gain-ing one attribute may result in losing another. The paper proposes an elastic pointer directory (EPD) structure based on the analysis of shared-memory applications, taking the fact that the number of sharers for each directory entry is typical y smal . Analysis re-sults show that for 4 096 nodes, the ratio of memory overhead to the ful-map directory is 2.7%. Theoretical analysis and cycle-accurate execution-driven simulations on a 16 and 64-node cache coherence non uniform memory access (CC-NUMA) multiproces-sor show that the corresponding pointer overflow probability is reduced significantly. The performance is observed to be better than that of a limited pointers directory and almost identical to the ful-map directory, except for the slight implementation complex-ity. Using the directory cache to explore directory access locality is also studied. The experimental result shows that this is a promis-ing approach to be used in the state-of-the-art high performance computing domain.展开更多
The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improvi...The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list. This paper presents the design of a new management protocol-NONH (New-OwnerNew-Head) and its performance evaluation. The analysis results show that thisprotocol can improve the scalability and performence of a coherent DSM system using linked list. It is also suitable for managing the cache coherency in tree-like hierarchical architecture.展开更多
As more processing cores are integrated into one chip and feature size continues to shrink, the average access la- tency for remote nodes using directory-based coherence protocol becomes higher, which greatly impacts ...As more processing cores are integrated into one chip and feature size continues to shrink, the average access la- tency for remote nodes using directory-based coherence protocol becomes higher, which greatly impacts system performance. Previous techniques such as data replication and data migration optimize the performance of the requesting core, but offer little improvement for neighbor nodes. Other techniques such as in-transit optimization try to reduce latency at the cost of increased storage. This paper introduces hierarchical cache directory into CMP (chip multiprocessor), which divides CMP tiles into multiple regions hierarchically, and combines it with data replication. A new directory organization is proposed to record the share status within a region and assist the regional home to complete operation efficiently. Simulation results show that for a 16-core CMP, compared to traditional directory, hierarchical cache directory reduces average access latency by 9% and on-chip network traffic by 34% on average with less storage. Theoretical analyses show that for a 2^n × 2^n tiled CMP, the average access latency in hierarchical cache directory asymptotically approaches a function that is independent of n, hence the architecture is highly scalable.展开更多
As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocess...As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.展开更多
Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This pa...Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.展开更多
基金Supported by the Intel Strategic CAD Labs, the National Natural Science Foundation of China (Grant No. 60421001), and the Chinese Academy of Sciences.
文摘A method for automatic verification of cache coherence protocols is presented, in which cache coherence protocols are modeled as concurrent value-passing processes, and control and data consistency requirement are described as formulas in first-order p-calculus. A model checker is employed to check if the protocol under investigation satisfies the required properties. Using this method a data consistency error has been revealed in a well-known cache coherence protocol. The error has been corrected, and the revised protocol has been shown free from data consistency error for any data domain size, by appealing to data independence technique.
文摘Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors. Although scalable to a certain extent, directory protocols are complex enough to prevent it from being used in very large scale multiprocessors with tens of thousands of nodes. This paper proposes a lock-based cache coherence protocol for scope conyistency. It does not rely on directory information to maintain cache coherence. Instead, cache coherence is mailltained through requiring the releasing processor of a lock to store all write-notices generated in the associated critical section to the lock and the acquiring processor invalidates or updates its locally cached data copies according to the write notices of the lock. To evaluate the performance of the lock-based cache coherence protocol, a software DSM system named JIAJIA is built on network of workstations. Besides the lockbased cache coherence protocol, JIAJIA also characterizes itself with its shared memory organization scheme which combines the physical memories of multiple workstations to form a large shared space. Performance measurements with SPLASH2 program suite and NAS benchmarks indicate that, compared to recent SVM systems such as CVM, higher speedup is achieved by JIAJIA.Besides, JIAJIA can solve large scale problems that cannot be solved by other SVM systems due to memory size limitation.
文摘This paper analyzes cache coherency mechanism from the view of system. It firstly discusses caehe-memory hierarchy of Pentium Ⅲ SMP system, including memory area distribution, cache attributes control and bus transaction. Secondly it analyzes hardware snoopy mechanism of P6 bus and MESI state transitions adopted by Pentium Ⅲ. Based on these, it focuses on how muhiprocessors and the P6 bus cooperate to ensure cache coherency of the whole system, and gives the key of cache coherency design.
基金supported by the National Natural Science Foundation of China(6123200961370059)+1 种基金the High Technology Research and Development Program of China(863 Program)(2011AA01A205)the Fund of the State Key Laboratory of Software Development Environment(SKLSDE2012ZX06)
文摘In the field of supercomputing, one key issue for scal-able shared-memory multiprocessors is the design of the directory which denotes the sharing state for a cache block. A good direc-tory design intends to achieve three key attributes: reasonable memory overhead, sharer position precision and implementation complexity. However, researchers often face the problem that gain-ing one attribute may result in losing another. The paper proposes an elastic pointer directory (EPD) structure based on the analysis of shared-memory applications, taking the fact that the number of sharers for each directory entry is typical y smal . Analysis re-sults show that for 4 096 nodes, the ratio of memory overhead to the ful-map directory is 2.7%. Theoretical analysis and cycle-accurate execution-driven simulations on a 16 and 64-node cache coherence non uniform memory access (CC-NUMA) multiproces-sor show that the corresponding pointer overflow probability is reduced significantly. The performance is observed to be better than that of a limited pointers directory and almost identical to the ful-map directory, except for the slight implementation complex-ity. Using the directory cache to explore directory access locality is also studied. The experimental result shows that this is a promis-ing approach to be used in the state-of-the-art high performance computing domain.
文摘The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list. This paper presents the design of a new management protocol-NONH (New-OwnerNew-Head) and its performance evaluation. The analysis results show that thisprotocol can improve the scalability and performence of a coherent DSM system using linked list. It is also suitable for managing the cache coherency in tree-like hierarchical architecture.
基金supported by the National Natural Science Foundation of China under Grant Nos.60673145,60773146 and 60833004.
文摘As more processing cores are integrated into one chip and feature size continues to shrink, the average access la- tency for remote nodes using directory-based coherence protocol becomes higher, which greatly impacts system performance. Previous techniques such as data replication and data migration optimize the performance of the requesting core, but offer little improvement for neighbor nodes. Other techniques such as in-transit optimization try to reduce latency at the cost of increased storage. This paper introduces hierarchical cache directory into CMP (chip multiprocessor), which divides CMP tiles into multiple regions hierarchically, and combines it with data replication. A new directory organization is proposed to record the share status within a region and assist the regional home to complete operation efficiently. Simulation results show that for a 16-core CMP, compared to traditional directory, hierarchical cache directory reduces average access latency by 9% and on-chip network traffic by 34% on average with less storage. Theoretical analyses show that for a 2^n × 2^n tiled CMP, the average access latency in hierarchical cache directory asymptotically approaches a function that is independent of n, hence the architecture is highly scalable.
基金supported by the National Natural Science Foundation of China under Grant Nos.60970002,60833004,60773146, and 60673145.
文摘As the number of cores in chip multiprocessors (CMPs) increases, cache coherence protocol has become a key issue in integration of chip multiprocessors. Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles: design complexity, performance and scalability. This paper proposes Cache Coherent Network on Chip (CCNoC), a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and shared L2 caches from the chore of maintaining coherency, thereby reduces design complexity of CMPs. In this way, CCNoC also improves the performance of cache coherence protocol through reducing directory access latency and enhances scalability by avoiding massive directories overhead in shared L2 caches. In CCNoC, coherence state caches and active directory caches are implemented in the network interface components of network on chip to maintain cache coherence states for blocks in L1 caches and manage directory information for recently accessed blocks in L2 caches respectively. CCNoC provides a scalable CMP framework to tackle cache coherency which is the foundation of CMP. This paper evaluates the performance of CCNoC. Experimental results show that for a 16-core system, CCNoC improves performance by 3% on average over the conventional chip multiprocessor and by 10% at best, while reduces storage overhead by 1.8% and saves directory storage by 88%, showing good scalability.
基金Supported by the National High Technology Development 863 Program of China under Grant No.2008AA010901the National Natural Science Foundation of China under Grant Nos.60736012 and 60673146the National Basic Research 973 Program of China under Grant No.2005CB321601.
文摘Godson-3 is the latest generation of Godson microprocessor family. It takes a scalable multi-core architecture with hardware support for accelerating applications including X86 emulation and signal processing. This paper introduces the system architecture of Godson-3 from various aspects including system scalability, organization of memory hierarchy, network-on-chip, inter-chip connection and I/O subsystem.