An integrated method for concurrency control in parallel real-time databases has been proposed in this paper. The nested transaction model has been investigated to offer more atomic execution units and finer grained c...An integrated method for concurrency control in parallel real-time databases has been proposed in this paper. The nested transaction model has been investigated to offer more atomic execution units and finer grained control within in a transaction. Based on the classical nested locking protocol and the speculative concurrency control approach, a two-shadow adaptive concurrency control protocol, which combines the Sacrifice based Optimistic Concurrency Control (OPT-Sacrifice) and High Priority two-phase locking (HP2PL) algorithms together to support both optimistic and pessimistic shadow of each sub-transaction, has been proposed to increase the likelihood of successful timely commitment and to avoid unnecessary replication overload.展开更多
This paper offers a new method to solve the problem of software pipelininsr on nested loops. We first introduce our new software pipelininog method. Ruminate Method, which can optimize program with nested loops. We al...This paper offers a new method to solve the problem of software pipelininsr on nested loops. We first introduce our new software pipelininog method. Ruminate Method, which can optimize program with nested loops. We also outline an algorithm to realize it and introduce the hardware support we designed. The performance of Ruminate Method is analyzed at the end of this paper with the aid of our preliminary experimental result.展开更多
Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still ...Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.展开更多
论文提出基于平行嵌套阵互协方差的2维(Two Dimensional,2D)波达角(Direction Of Arrival,DOA)联合估计算法。算法基于两个互相平行的嵌套阵的互协方差生成较长虚拟阵列,同时将2维DOA估计问题降维为1维DOA估计问题。在构造协方差矩阵时...论文提出基于平行嵌套阵互协方差的2维(Two Dimensional,2D)波达角(Direction Of Arrival,DOA)联合估计算法。算法基于两个互相平行的嵌套阵的互协方差生成较长虚拟阵列,同时将2维DOA估计问题降维为1维DOA估计问题。在构造协方差矩阵时,利用方向矩阵范德蒙特性增加虚拟快拍数,保证了孔径的最小损失。最后算法基于酉旋转不变技术(Estimation of Signal Parameters via Rotational Invariance Technique,ESPRIT)和总体最小二乘(Total Least Squares,TLS)方法进一步降低噪声影响,并获得了自动配对的2维DOA估计。相比传统平行阵下的DOA估计算法,该算法拥有更好的DOA估计性能,能辨识更多的空间信源,对空间色噪声有更强的鲁棒性。仿真结果验证了算法的有效性。展开更多
文摘An integrated method for concurrency control in parallel real-time databases has been proposed in this paper. The nested transaction model has been investigated to offer more atomic execution units and finer grained control within in a transaction. Based on the classical nested locking protocol and the speculative concurrency control approach, a two-shadow adaptive concurrency control protocol, which combines the Sacrifice based Optimistic Concurrency Control (OPT-Sacrifice) and High Priority two-phase locking (HP2PL) algorithms together to support both optimistic and pessimistic shadow of each sub-transaction, has been proposed to increase the likelihood of successful timely commitment and to avoid unnecessary replication overload.
文摘This paper offers a new method to solve the problem of software pipelininsr on nested loops. We first introduce our new software pipelininog method. Ruminate Method, which can optimize program with nested loops. We also outline an algorithm to realize it and introduce the hardware support we designed. The performance of Ruminate Method is analyzed at the end of this paper with the aid of our preliminary experimental result.
基金This work was supported by the National Science Foundation of USA under Grant No. CCF-1216569 and a CAREER award of National Science Foundation of USA under Grant No. CCF-0968667.
文摘Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.
文摘论文提出基于平行嵌套阵互协方差的2维(Two Dimensional,2D)波达角(Direction Of Arrival,DOA)联合估计算法。算法基于两个互相平行的嵌套阵的互协方差生成较长虚拟阵列,同时将2维DOA估计问题降维为1维DOA估计问题。在构造协方差矩阵时,利用方向矩阵范德蒙特性增加虚拟快拍数,保证了孔径的最小损失。最后算法基于酉旋转不变技术(Estimation of Signal Parameters via Rotational Invariance Technique,ESPRIT)和总体最小二乘(Total Least Squares,TLS)方法进一步降低噪声影响,并获得了自动配对的2维DOA估计。相比传统平行阵下的DOA估计算法,该算法拥有更好的DOA估计性能,能辨识更多的空间信源,对空间色噪声有更强的鲁棒性。仿真结果验证了算法的有效性。