A Single-Buffered (SB) router is a router where only one stage of shared buffering is sandwiched between two interconnects in comparison of a Combined Input and Output Queued (CIOQ) router where a central switch f...A Single-Buffered (SB) router is a router where only one stage of shared buffering is sandwiched between two interconnects in comparison of a Combined Input and Output Queued (CIOQ) router where a central switch fabric is sandwiched between two stages of buffering. The notion of SB routers was firstly proposed by the High-Performance Networking Group (HPNG) of Stanford University, along with two promising designs of SB routers: one of which was Parallel Shared Memory (PSM) router and the other was Distributed Shared Memory (DSM) router. Admittedly, the work of HPNG deserved full credit, but all results presented by them appeared to relay on a Centralized Memory Management Algorithm (CMMA) which was essentially impractical because of the high processing and communication complexity. This paper attempts to make a scalable high-speed SB router completely practical by introducing a fully distributed architecture for managing the shared memory of SB routers. The resulting SB router is called as a Virtual Output and Input Queued (VOIQ) router. Furthermore, the scheme of VOIQ routers can not only eliminate the need for the CMMA scheduler, thus allowing a fully distributed implementation with low processing and commu- nication complexity, but also provide QoS guarantees and efficiently support variable-length packets in this paper. In particular, the results of performance testing and the hardware implementation of our VOIQ-based router (NDSC~ SR1880-TTM series) are illustrated at the end of this paper. The proposal of this paper is the first distributed scheme of how to design and implement SB routers publicized till now.展开更多
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order ...Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.展开更多
Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading a...Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading approach is expected to be especially popular in the near future because it can dynamically migrate running threads to computing nodes and does not require any modifications of existing applications to do so. The current DSM-based computation offloading scheme, however, has focused on efficiently offloading computationally intensive applications and has not considered the significant performance degradation caused by processing the I/O requests issued by offloaded threads. Because most mobile applications are interactive and thus yield frequent I/O requests, efficient handling of I/O operations is critically important. In this paper, we quantitatively analyze the performance degradation caused by I/O processing in DSM-based computation offloading schemes using representative commodity applications. To remedy the performance degradation, we apply a remote I/O scheme based on remote device support to computation offloading. The proposed approach improves the execution time by up to 43.6% and saves up to 17.7% of energy consumption in comparison with the existing offloading schemes. Selective compression of the remote I/O scheme reduces the network traffic by up to 53.5%.展开更多
The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improvi...The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list. This paper presents the design of a new management protocol-NONH (New-OwnerNew-Head) and its performance evaluation. The analysis results show that thisprotocol can improve the scalability and performence of a coherent DSM system using linked list. It is also suitable for managing the cache coherency in tree-like hierarchical architecture.展开更多
基金the National High-Tech Research and De-velopment Program of China (863 Program) (2003AA103510, 2004AA103130, 2005AA121210).
文摘A Single-Buffered (SB) router is a router where only one stage of shared buffering is sandwiched between two interconnects in comparison of a Combined Input and Output Queued (CIOQ) router where a central switch fabric is sandwiched between two stages of buffering. The notion of SB routers was firstly proposed by the High-Performance Networking Group (HPNG) of Stanford University, along with two promising designs of SB routers: one of which was Parallel Shared Memory (PSM) router and the other was Distributed Shared Memory (DSM) router. Admittedly, the work of HPNG deserved full credit, but all results presented by them appeared to relay on a Centralized Memory Management Algorithm (CMMA) which was essentially impractical because of the high processing and communication complexity. This paper attempts to make a scalable high-speed SB router completely practical by introducing a fully distributed architecture for managing the shared memory of SB routers. The resulting SB router is called as a Virtual Output and Input Queued (VOIQ) router. Furthermore, the scheme of VOIQ routers can not only eliminate the need for the CMMA scheduler, thus allowing a fully distributed implementation with low processing and commu- nication complexity, but also provide QoS guarantees and efficiently support variable-length packets in this paper. In particular, the results of performance testing and the hardware implementation of our VOIQ-based router (NDSC~ SR1880-TTM series) are illustrated at the end of this paper. The proposal of this paper is the first distributed scheme of how to design and implement SB routers publicized till now.
基金supported in part by Spanish Government and European ERDF under Grant Nos. TIN2007-66423, TIN2010-21291-C02-01 and TIN2007-60625gaZ:T48 research group (Arag'on Government and European ESF)+1 种基金Consolider CSD2007-00050 (Spanish Government)HiPEAC-2 NoE (European FP7/ICT 217068)
文摘Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.
文摘Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading approach is expected to be especially popular in the near future because it can dynamically migrate running threads to computing nodes and does not require any modifications of existing applications to do so. The current DSM-based computation offloading scheme, however, has focused on efficiently offloading computationally intensive applications and has not considered the significant performance degradation caused by processing the I/O requests issued by offloaded threads. Because most mobile applications are interactive and thus yield frequent I/O requests, efficient handling of I/O operations is critically important. In this paper, we quantitatively analyze the performance degradation caused by I/O processing in DSM-based computation offloading schemes using representative commodity applications. To remedy the performance degradation, we apply a remote I/O scheme based on remote device support to computation offloading. The proposed approach improves the execution time by up to 43.6% and saves up to 17.7% of energy consumption in comparison with the existing offloading schemes. Selective compression of the remote I/O scheme reduces the network traffic by up to 53.5%.
文摘The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list. This paper presents the design of a new management protocol-NONH (New-OwnerNew-Head) and its performance evaluation. The analysis results show that thisprotocol can improve the scalability and performence of a coherent DSM system using linked list. It is also suitable for managing the cache coherency in tree-like hierarchical architecture.