As a result of the interplay between advances in computer hardware, software, and algorithm, we are now in a new era of large-scale reservoir simulation, which focuses on accurate flow description, fine reservoir char...As a result of the interplay between advances in computer hardware, software, and algorithm, we are now in a new era of large-scale reservoir simulation, which focuses on accurate flow description, fine reservoir characterization, efficient nonlinear/linear solvers, and parallel implementation. In this paper, we discuss a multilevel preconditioner in a new-generation simulator and its implementation on multicore computers. This preconditioner relies on the method of subspace corrections to solve large-scale linear systems arising from fully implicit methods in reservoir simulations. We investigate the parallel efficiency and robustness of the proposed method by applying it to million-cell benchmark problems.展开更多
The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algori...The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algorithms. The design of efficient parallel algorithms should be based on the following considerationst algorithm parallelism and the hardware-parallelism; granularity of the parallel algorithm, algorithm optimization according to the underling parallel machine. In this paper , these principles are applied to solve a model problem of the PDE. The speedup of the new method is high. The results were tested and evaluated on a shared memory MIMD machine. The practical results were agree with the predicted performance.展开更多
Shared Memory (SM) switches are widely used for its high throughput, low delay and efficient use of memory. This paper compares the performance of two prominent switching schemes of SM packet switches: Cell-Based Swit...Shared Memory (SM) switches are widely used for its high throughput, low delay and efficient use of memory. This paper compares the performance of two prominent switching schemes of SM packet switches: Cell-Based Switching (CBS) and Packet-Based Switching (PBS).Theoretical analysis is carried out to draw qualitative conclusion on the memory requirement,throughput and packet delay of the two schemes. Furthermore, simulations are carried out to get quantitative results of the performance comparison under various system load, traffic patterns,and memory sizes. Simulation results show that PBS has the advantage of shorter time delay while CBS has lower memory requirement and outperforms in throughput when the memory size is limited. The comparison can be used for tradeoff between performance and complexity in switch design.展开更多
Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill whic...Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill which the iteration is executed asynchronously: Each processor calculate the solution of an individual nonlinear system belong to its nonlinear multisplitting and can update the global approximation residing in the shared memory at any time. A local convergence analysis of this model is presented. Finally, we give a uumerical example which shows a 'strange' property that speedup Sp > p and efficiency Ep > 1.展开更多
A multicast replication algorithm is proposed for shared memory switches. It uses a dedicated FIFO to multicast by replicating cells at receiver and the FIFO is operating with shared memory in parallel. Speedup is use...A multicast replication algorithm is proposed for shared memory switches. It uses a dedicated FIFO to multicast by replicating cells at receiver and the FIFO is operating with shared memory in parallel. Speedup is used to promote loss and delay performance. A new queueing analytical model is developed based on a sub-timeslot approach. The system performance in terms of cell loss and delay is analyzed and verified by simulation.展开更多
The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that ...The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that developers can make ideal casting designs. To take the prior occupation at commercial simulation market, so many development groups in the world are doing their every effort. They already reported successful stories in manufacturing fields by developing and providing the high performance simulation technologies for multipurpose. But they all run at powerful desk-side computers by well-trained experts mainly, so that it is hard to diffuse the scientific designing concept to newcomers in casting field. To overcome upcoming problems in scientific casting designs, we utilized information technologies and full-matured hardware backbones to spread out the effective and scientific casting design mind, and they all were integrated into Simulation Portal on the web. It professes scientific casting design on the NET including ubiquitous access way represented by "Anyone, Anytime, Anywhere" concept for casting designs.展开更多
GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improv...GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource: Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.展开更多
This paper proposes a Shared Buffer Memory ATM Access Switch . The switches have significant benefits over Crossbar or Bus Based switches because its output buffer memories are shared by all the switch output ports an...This paper proposes a Shared Buffer Memory ATM Access Switch . The switches have significant benefits over Crossbar or Bus Based switches because its output buffer memories are shared by all the switch output ports and are allotted to one particular output port as the occasion demands. As Buffer allocation schemes in the ATM Switches is Partial Sharing, it is trade-off between Complete Sharing and Dedicated Allocation. In addition, the queuing structures used in the shared memory are independent of both the data path through the switch and the cell scheduling mechanism. The method for queue management is simple and effective.展开更多
To ensure the uniqueness and recognition of data and make it easy to analyze and process the data of all subsystems of the neutral beam injector (NBI), it is required that all subsystems have a unified system time. ...To ensure the uniqueness and recognition of data and make it easy to analyze and process the data of all subsystems of the neutral beam injector (NBI), it is required that all subsystems have a unified system time. In this paper, the timing synchronization software is presented which is related to many kinds of technologies, such as shared memory, multithreading, TCP protocol and so on. Shared memory helps the server save the information of clients and system time, multithreading can deal with different clients with different threads, the server works under Linux operating system, the client works under Linux operating system and Windows operating system. With the help of this design, synchronization of all subsystems can be achieved in less than one second, and this accuracy is enough for the NBI system and the reliability of data is thus ensured.展开更多
The Distributed Shared Memory(DSM)architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap,and it inevitably exhibits Non-Uniform Memory Access(NUMA)to shared-memory ...The Distributed Shared Memory(DSM)architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap,and it inevitably exhibits Non-Uniform Memory Access(NUMA)to shared-memory parallel applications.Failure to adapt to the NUMA effect can significantly downgrade application performance,especially on today’s manycore platforms with tens to hundreds of cores.However,traditional approaches such as first-touch and memory policy fall short in false page-sharing,fragmentation,or ease of use.In this paper,we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation.Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.展开更多
Implementing video applications on emerging multi-core processors is a promising technique for personal, real-time multi-media applications. However, when porting the legacy parallel video encoders developed for clust...Implementing video applications on emerging multi-core processors is a promising technique for personal, real-time multi-media applications. However, when porting the legacy parallel video encoders developed for clusters to shared-memory multi-cores, the existing parallel algorithms result in workload imbalances on different cores and communication inefficiencies. This paper describes a strip-wise parallel scheme to balance workloads and a hybrid communication mechanism to reduce communication overhead. The implementation of the H.264 parallel encoder on an eight CPU Intel Xeon system achieves 5x to 6x speed-up over a single thread encoder and achieves a 29% performance improvement over the commonly used master-slave schemes on clusters. The paper also gives further analysis on scalability, parallel efficiency, workload balance, and communication overhead as the number of cores varies.展开更多
Thaditional implementation of sequential consistency in shared-memory systems requires memory accesses to be globally performed in program order. Based on an event ordering model for correct executions in shared-memor...Thaditional implementation of sequential consistency in shared-memory systems requires memory accesses to be globally performed in program order. Based on an event ordering model for correct executions in shared-memory systems, this paper proposes and proves that out-of-order execution does not influence the correctness of an execution providing certain condition is met. Simulation results show that out-of-order execution proposed in this paper is an effective way to improve the performance of a sequentially consistent shared-memory system.展开更多
A Single-Buffered (SB) router is a router where only one stage of shared buffering is sandwiched between two interconnects in comparison of a Combined Input and Output Queued (CIOQ) router where a central switch f...A Single-Buffered (SB) router is a router where only one stage of shared buffering is sandwiched between two interconnects in comparison of a Combined Input and Output Queued (CIOQ) router where a central switch fabric is sandwiched between two stages of buffering. The notion of SB routers was firstly proposed by the High-Performance Networking Group (HPNG) of Stanford University, along with two promising designs of SB routers: one of which was Parallel Shared Memory (PSM) router and the other was Distributed Shared Memory (DSM) router. Admittedly, the work of HPNG deserved full credit, but all results presented by them appeared to relay on a Centralized Memory Management Algorithm (CMMA) which was essentially impractical because of the high processing and communication complexity. This paper attempts to make a scalable high-speed SB router completely practical by introducing a fully distributed architecture for managing the shared memory of SB routers. The resulting SB router is called as a Virtual Output and Input Queued (VOIQ) router. Furthermore, the scheme of VOIQ routers can not only eliminate the need for the CMMA scheduler, thus allowing a fully distributed implementation with low processing and commu- nication complexity, but also provide QoS guarantees and efficiently support variable-length packets in this paper. In particular, the results of performance testing and the hardware implementation of our VOIQ-based router (NDSC~ SR1880-TTM series) are illustrated at the end of this paper. The proposal of this paper is the first distributed scheme of how to design and implement SB routers publicized till now.展开更多
This paper studies a high-speed text-independent Automatic Speaker Recognition(ASR)algorithm based on a multicore system's Gaussian Mixture Model(GMM).The high speech is achieved using parallel implementation of t...This paper studies a high-speed text-independent Automatic Speaker Recognition(ASR)algorithm based on a multicore system's Gaussian Mixture Model(GMM).The high speech is achieved using parallel implementation of the feature's extraction and aggregation methods during training and testing procedures.Shared memory parallel programming techniques using both OpenMP and PThreads libraries are developed to accelerate the code and improve the performance of the ASR algorithm.The experimental results show speed-up improvements of around 3.2 on a personal laptop with Intel i5-6300HQ(2.3 GHz,four cores without hyper-threading,and 8 GB of RAM).In addition,a remarkable 100%speaker recognition accuracy is achieved.展开更多
Efficient resource utilization requires that emerging datacenter interconnects support both high performance communication and efficient remote resource sharing. These goals require that the network be more tightly co...Efficient resource utilization requires that emerging datacenter interconnects support both high performance communication and efficient remote resource sharing. These goals require that the network be more tightly coupled with the CPU chips. Designing a new interconnection technology thus requires considering not only the interconnection itself, but also the design of the processors that will rely on it. In this paper, we study memory hierarchy implications for the design of high-speed datacenter interconnects particularly as they affect remote memory access -- and we use PCIe as the vehicle for our investigations. To that end, we build three complementary platforms: a PCIe-interconnected prototype server with which we measure and analyze current bottlenecks; a software simulator that lets us model microarchitectural and cache hierarchy changes; and an FPGA prototype system with a streamlined switchless customized protocol Thunder with which we study hardware optimizations outside the processor. We highlight several architectural modifications to better support remote memory access and communication, and quantify their impact and ]imitations.展开更多
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order ...Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.展开更多
Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading a...Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading approach is expected to be especially popular in the near future because it can dynamically migrate running threads to computing nodes and does not require any modifications of existing applications to do so. The current DSM-based computation offloading scheme, however, has focused on efficiently offloading computationally intensive applications and has not considered the significant performance degradation caused by processing the I/O requests issued by offloaded threads. Because most mobile applications are interactive and thus yield frequent I/O requests, efficient handling of I/O operations is critically important. In this paper, we quantitatively analyze the performance degradation caused by I/O processing in DSM-based computation offloading schemes using representative commodity applications. To remedy the performance degradation, we apply a remote I/O scheme based on remote device support to computation offloading. The proposed approach improves the execution time by up to 43.6% and saves up to 17.7% of energy consumption in comparison with the existing offloading schemes. Selective compression of the remote I/O scheme reduces the network traffic by up to 53.5%.展开更多
The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improvi...The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list. This paper presents the design of a new management protocol-NONH (New-OwnerNew-Head) and its performance evaluation. The analysis results show that thisprotocol can improve the scalability and performence of a coherent DSM system using linked list. It is also suitable for managing the cache coherency in tree-like hierarchical architecture.展开更多
GPUs are widely used in modem high-performance computing systems.To reduce the burden of GPU programmers,operating system and GPU hardware provide great supports for shared virtual memory,which enables GPU and CPU to ...GPUs are widely used in modem high-performance computing systems.To reduce the burden of GPU programmers,operating system and GPU hardware provide great supports for shared virtual memory,which enables GPU and CPU to share the same virtual address space.Unfortunately,the current SIMT execution model of GPU brings great challenges for the virtual-physical address translation on the GPU side,mainly due to the huge number of virtual addresses which are generated simultaneously and the bad locality of these virtual addresses.Thus,the excessive TLB accesses increase the miss ratio of TLB.As an attractive solution,Page Walk Cache(PWC)has received wide attention for its capability of reducing the memory accesses caused by TLB misses.However,the current PWC mechanism suffers from heavy redundancies,which significantly limits its efficiency.In this paper,we first investigate the facts leading to this issue by evaluating the performance of PWC with typical GPU benchmarks.We find that the repeated L4 and L3 indices of virtual addresses increase the redundancies in PWC,and the low locality of L2 indices causes the low hit ratio in PWC.Based on these observations,we propose a new PWC structure,namely Compressed Page Walk Cache(CPWC),to resolve the redundancy burden in current PWC.Our CPWC can be organized in either direct-mapped mode or set-associated mode.Experimental results show that CPWC increases by 3 times over TPC in the number of page table entries,increases by 38.3%over PWC in L2 index hit ratio and reduces by 26.9%in the memory accesses of page tables.The average memory accesses caused by each TLB miss is reduced to 1.13.Overall,the average IPC can improve by 25.3%.展开更多
This paper proposes a class of asynchronous block iterative methods for solving large scale nonlinear equations F(x)=0 and proves local convergence. This method splits F into p blocks, then does the asynch...This paper proposes a class of asynchronous block iterative methods for solving large scale nonlinear equations F(x)=0 and proves local convergence. This method splits F into p blocks, then does the asynchronous parallel iteration on the p multiprocessor with shared memory. Because each processor need only solve equations with a low dimension and there is no synchronous waiting time, the parallel efficiency can be increased. Finally, we give the results of the numerical test of three kinds of Newton like asynchronous block iteration methods which run well on a multiprocessor system. These results show that the parallel efficiency is very high.展开更多
基金support through PetroChina New-generation Reservoir Simulation Software (2011A-1010)the Program of Research on Continental Sedimentary Oil Reservoir Simulation (z121100004912001)+7 种基金founded by Beijing Municipal Science & Technology Commission and PetroChina Joint Research Funding12HT1050002654partially supported by the NSFC Grant 11201398Hunan Provincial Natural Science Foundation of China Grant 14JJ2063Specialized Research Fund for the Doctoral Program of Higher Education of China Grant 20124301110003partially supported by the Dean’s Startup Fund, Academy of Mathematics and System Sciences and the State High Tech Development Plan of China (863 Program 2012AA01A309partially supported by NSFC Grant 91130002Program for Changjiang Scholars and Innovative Research Team in University of China Grant IRT1179the Scientific Research Fund of the Hunan Provincial Education Department of China Grant 12A138
文摘As a result of the interplay between advances in computer hardware, software, and algorithm, we are now in a new era of large-scale reservoir simulation, which focuses on accurate flow description, fine reservoir characterization, efficient nonlinear/linear solvers, and parallel implementation. In this paper, we discuss a multilevel preconditioner in a new-generation simulator and its implementation on multicore computers. This preconditioner relies on the method of subspace corrections to solve large-scale linear systems arising from fully implicit methods in reservoir simulations. We investigate the parallel efficiency and robustness of the proposed method by applying it to million-cell benchmark problems.
文摘The design of parallel algorithms is studied in this paper. These algorithms are applicable to shared memory MIMD machines In this paper, the emphasis is put on the methods for design of the efficient parallel algorithms. The design of efficient parallel algorithms should be based on the following considerationst algorithm parallelism and the hardware-parallelism; granularity of the parallel algorithm, algorithm optimization according to the underling parallel machine. In this paper , these principles are applied to solve a model problem of the PDE. The speedup of the new method is high. The results were tested and evaluated on a shared memory MIMD machine. The practical results were agree with the predicted performance.
基金Supported by the National Natural Science Foundation of China(No.69896242).
文摘Shared Memory (SM) switches are widely used for its high throughput, low delay and efficient use of memory. This paper compares the performance of two prominent switching schemes of SM packet switches: Cell-Based Switching (CBS) and Packet-Based Switching (PBS).Theoretical analysis is carried out to draw qualitative conclusion on the memory requirement,throughput and packet delay of the two schemes. Furthermore, simulations are carried out to get quantitative results of the performance comparison under various system load, traffic patterns,and memory sizes. Simulation results show that PBS has the advantage of shorter time delay while CBS has lower memory requirement and outperforms in throughput when the memory size is limited. The comparison can be used for tradeoff between performance and complexity in switch design.
文摘Nonlinear multisplitting method is known as parallel iterative methods for solving a large-scale system of nonlinear equations F(x) = 0. We extend the idea of nonlinear multisplitting and consider a new model ill which the iteration is executed asynchronously: Each processor calculate the solution of an individual nonlinear system belong to its nonlinear multisplitting and can update the global approximation residing in the shared memory at any time. A local convergence analysis of this model is presented. Finally, we give a uumerical example which shows a 'strange' property that speedup Sp > p and efficiency Ep > 1.
文摘A multicast replication algorithm is proposed for shared memory switches. It uses a dedicated FIFO to multicast by replicating cells at receiver and the FIFO is operating with shared memory in parallel. Speedup is used to promote loss and delay performance. A new queueing analytical model is developed based on a sub-timeslot approach. The system performance in terms of cell loss and delay is analyzed and verified by simulation.
文摘The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that developers can make ideal casting designs. To take the prior occupation at commercial simulation market, so many development groups in the world are doing their every effort. They already reported successful stories in manufacturing fields by developing and providing the high performance simulation technologies for multipurpose. But they all run at powerful desk-side computers by well-trained experts mainly, so that it is hard to diffuse the scientific designing concept to newcomers in casting field. To overcome upcoming problems in scientific casting designs, we utilized information technologies and full-matured hardware backbones to spread out the effective and scientific casting design mind, and they all were integrated into Simulation Portal on the web. It professes scientific casting design on the NET including ubiquitous access way represented by "Anyone, Anytime, Anywhere" concept for casting designs.
基金This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, the National Natural Science Foundation of China (NSFC) under Grant No. 61432018, and the Innovation Research Group of NSFC under Grant No. 61221062.
文摘GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource: Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.
文摘This paper proposes a Shared Buffer Memory ATM Access Switch . The switches have significant benefits over Crossbar or Bus Based switches because its output buffer memories are shared by all the switch output ports and are allotted to one particular output port as the occasion demands. As Buffer allocation schemes in the ATM Switches is Partial Sharing, it is trade-off between Complete Sharing and Dedicated Allocation. In addition, the queuing structures used in the shared memory are independent of both the data path through the switch and the cell scheduling mechanism. The method for queue management is simple and effective.
基金supported by National Natural Science Foundation of China(No.11075183)the Knowledge Innovation Program of the Chinese Academy of Sciences(the study of neutral beam steady-state operation of the key technical and physical problems)
文摘To ensure the uniqueness and recognition of data and make it easy to analyze and process the data of all subsystems of the neutral beam injector (NBI), it is required that all subsystems have a unified system time. In this paper, the timing synchronization software is presented which is related to many kinds of technologies, such as shared memory, multithreading, TCP protocol and so on. Shared memory helps the server save the information of clients and system time, multithreading can deal with different clients with different threads, the server works under Linux operating system, the client works under Linux operating system and Windows operating system. With the help of this design, synchronization of all subsystems can be achieved in less than one second, and this accuracy is enough for the NBI system and the reliability of data is thus ensured.
基金supported by the National Key Research and Development Program of China(No.2016YFB0201300)。
文摘The Distributed Shared Memory(DSM)architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap,and it inevitably exhibits Non-Uniform Memory Access(NUMA)to shared-memory parallel applications.Failure to adapt to the NUMA effect can significantly downgrade application performance,especially on today’s manycore platforms with tens to hundreds of cores.However,traditional approaches such as first-touch and memory policy fall short in false page-sharing,fragmentation,or ease of use.In this paper,we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation.Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.
基金Supported by the National Natural Science Foundation of China(No. 60236020)
文摘Implementing video applications on emerging multi-core processors is a promising technique for personal, real-time multi-media applications. However, when porting the legacy parallel video encoders developed for clusters to shared-memory multi-cores, the existing parallel algorithms result in workload imbalances on different cores and communication inefficiencies. This paper describes a strip-wise parallel scheme to balance workloads and a hybrid communication mechanism to reduce communication overhead. The implementation of the H.264 parallel encoder on an eight CPU Intel Xeon system achieves 5x to 6x speed-up over a single thread encoder and achieves a 29% performance improvement over the commonly used master-slave schemes on clusters. The paper also gives further analysis on scalability, parallel efficiency, workload balance, and communication overhead as the number of cores varies.
文摘Thaditional implementation of sequential consistency in shared-memory systems requires memory accesses to be globally performed in program order. Based on an event ordering model for correct executions in shared-memory systems, this paper proposes and proves that out-of-order execution does not influence the correctness of an execution providing certain condition is met. Simulation results show that out-of-order execution proposed in this paper is an effective way to improve the performance of a sequentially consistent shared-memory system.
基金the National High-Tech Research and De-velopment Program of China (863 Program) (2003AA103510, 2004AA103130, 2005AA121210).
文摘A Single-Buffered (SB) router is a router where only one stage of shared buffering is sandwiched between two interconnects in comparison of a Combined Input and Output Queued (CIOQ) router where a central switch fabric is sandwiched between two stages of buffering. The notion of SB routers was firstly proposed by the High-Performance Networking Group (HPNG) of Stanford University, along with two promising designs of SB routers: one of which was Parallel Shared Memory (PSM) router and the other was Distributed Shared Memory (DSM) router. Admittedly, the work of HPNG deserved full credit, but all results presented by them appeared to relay on a Centralized Memory Management Algorithm (CMMA) which was essentially impractical because of the high processing and communication complexity. This paper attempts to make a scalable high-speed SB router completely practical by introducing a fully distributed architecture for managing the shared memory of SB routers. The resulting SB router is called as a Virtual Output and Input Queued (VOIQ) router. Furthermore, the scheme of VOIQ routers can not only eliminate the need for the CMMA scheduler, thus allowing a fully distributed implementation with low processing and commu- nication complexity, but also provide QoS guarantees and efficiently support variable-length packets in this paper. In particular, the results of performance testing and the hardware implementation of our VOIQ-based router (NDSC~ SR1880-TTM series) are illustrated at the end of this paper. The proposal of this paper is the first distributed scheme of how to design and implement SB routers publicized till now.
文摘This paper studies a high-speed text-independent Automatic Speaker Recognition(ASR)algorithm based on a multicore system's Gaussian Mixture Model(GMM).The high speech is achieved using parallel implementation of the feature's extraction and aggregation methods during training and testing procedures.Shared memory parallel programming techniques using both OpenMP and PThreads libraries are developed to accelerate the code and improve the performance of the ASR algorithm.The experimental results show speed-up improvements of around 3.2 on a personal laptop with Intel i5-6300HQ(2.3 GHz,four cores without hyper-threading,and 8 GB of RAM).In addition,a remarkable 100%speaker recognition accuracy is achieved.
基金This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA06010401, and the National Natural Science Foundation of China under Grant Nos. 61100010, 61402438, and 61402439.
文摘Efficient resource utilization requires that emerging datacenter interconnects support both high performance communication and efficient remote resource sharing. These goals require that the network be more tightly coupled with the CPU chips. Designing a new interconnection technology thus requires considering not only the interconnection itself, but also the design of the processors that will rely on it. In this paper, we study memory hierarchy implications for the design of high-speed datacenter interconnects particularly as they affect remote memory access -- and we use PCIe as the vehicle for our investigations. To that end, we build three complementary platforms: a PCIe-interconnected prototype server with which we measure and analyze current bottlenecks; a software simulator that lets us model microarchitectural and cache hierarchy changes; and an FPGA prototype system with a streamlined switchless customized protocol Thunder with which we study hardware optimizations outside the processor. We highlight several architectural modifications to better support remote memory access and communication, and quantify their impact and ]imitations.
基金supported in part by Spanish Government and European ERDF under Grant Nos. TIN2007-66423, TIN2010-21291-C02-01 and TIN2007-60625gaZ:T48 research group (Arag'on Government and European ESF)+1 种基金Consolider CSD2007-00050 (Spanish Government)HiPEAC-2 NoE (European FP7/ICT 217068)
文摘Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.
文摘Computation offloading enables mobile devices to execute rich applications by using the abundant computing resources of powerful server systems. The distributed shared memory based (DSM-based) computation offloading approach is expected to be especially popular in the near future because it can dynamically migrate running threads to computing nodes and does not require any modifications of existing applications to do so. The current DSM-based computation offloading scheme, however, has focused on efficiently offloading computationally intensive applications and has not considered the significant performance degradation caused by processing the I/O requests issued by offloaded threads. Because most mobile applications are interactive and thus yield frequent I/O requests, efficient handling of I/O operations is critically important. In this paper, we quantitatively analyze the performance degradation caused by I/O processing in DSM-based computation offloading schemes using representative commodity applications. To remedy the performance degradation, we apply a remote I/O scheme based on remote device support to computation offloading. The proposed approach improves the execution time by up to 43.6% and saves up to 17.7% of energy consumption in comparison with the existing offloading schemes. Selective compression of the remote I/O scheme reduces the network traffic by up to 53.5%.
文摘The management of memory coherence is an important problem in distributed shared memory (DSM) system. In a cache-based coherence DSM system using linked list structure, the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list. This paper presents the design of a new management protocol-NONH (New-OwnerNew-Head) and its performance evaluation. The analysis results show that thisprotocol can improve the scalability and performence of a coherent DSM system using linked list. It is also suitable for managing the cache coherency in tree-like hierarchical architecture.
基金This paper was supported by the National Natural Science Fundation of China(Grant No.61972407).
文摘GPUs are widely used in modem high-performance computing systems.To reduce the burden of GPU programmers,operating system and GPU hardware provide great supports for shared virtual memory,which enables GPU and CPU to share the same virtual address space.Unfortunately,the current SIMT execution model of GPU brings great challenges for the virtual-physical address translation on the GPU side,mainly due to the huge number of virtual addresses which are generated simultaneously and the bad locality of these virtual addresses.Thus,the excessive TLB accesses increase the miss ratio of TLB.As an attractive solution,Page Walk Cache(PWC)has received wide attention for its capability of reducing the memory accesses caused by TLB misses.However,the current PWC mechanism suffers from heavy redundancies,which significantly limits its efficiency.In this paper,we first investigate the facts leading to this issue by evaluating the performance of PWC with typical GPU benchmarks.We find that the repeated L4 and L3 indices of virtual addresses increase the redundancies in PWC,and the low locality of L2 indices causes the low hit ratio in PWC.Based on these observations,we propose a new PWC structure,namely Compressed Page Walk Cache(CPWC),to resolve the redundancy burden in current PWC.Our CPWC can be organized in either direct-mapped mode or set-associated mode.Experimental results show that CPWC increases by 3 times over TPC in the number of page table entries,increases by 38.3%over PWC in L2 index hit ratio and reduces by 26.9%in the memory accesses of page tables.The average memory accesses caused by each TLB miss is reduced to 1.13.Overall,the average IPC can improve by 25.3%.
基金Supported by the National Natural Scie-nce Foundation of China
文摘This paper proposes a class of asynchronous block iterative methods for solving large scale nonlinear equations F(x)=0 and proves local convergence. This method splits F into p blocks, then does the asynchronous parallel iteration on the p multiprocessor with shared memory. Because each processor need only solve equations with a low dimension and there is no synchronous waiting time, the parallel efficiency can be increased. Finally, we give the results of the numerical test of three kinds of Newton like asynchronous block iteration methods which run well on a multiprocessor system. These results show that the parallel efficiency is very high.