NIST(National Institute of Standards and Technology) statistical test recognized as the most authoritative is widely used in verifying the randomness of binary sequences. The Non-overlapping Template Matching Test as ...NIST(National Institute of Standards and Technology) statistical test recognized as the most authoritative is widely used in verifying the randomness of binary sequences. The Non-overlapping Template Matching Test as the 7 th test of the NIST Test Suit is remarkably time consuming and the slow performance is one of the major hurdles in the testing process. In this paper, we present an efficient bit-parallel matching algorithm and segmented scan-based strategy for execution on Graphics Processing Unit(GPU) using NVIDIA Compute Unified Device Architecture(CUDA). Experimental results show the significant performance improvement of the parallelized Non-overlapping Template Matching Test, the running speed is 483 times faster than the original NIST implementation without attenuating the test result accuracy.展开更多
In this paper, some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000. They include matrix multiplication, LU factorization of a dense matrix, Cholesky factorization of a ...In this paper, some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000. They include matrix multiplication, LU factorization of a dense matrix, Cholesky factorization of a symmetric matrix, and eigendecomposition of symmetric matrix for real and complex data types. These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization. The execution time, measured performance and speedup for each problem on Dawning-1000 are shown. For matrix multiplication and LU factorization, 1.86GFLOPS and 1.53GFLOPS are reached.展开更多
We present a parallel and linear scaling implementation of the calculation of the electrostatic potential arising from an arbitrary charge distribution.Our approach is making use of the multi-resolution basis of multi...We present a parallel and linear scaling implementation of the calculation of the electrostatic potential arising from an arbitrary charge distribution.Our approach is making use of the multi-resolution basis of multiwavelets.The potential is obtained as the direct solution of the Poisson equation in its Green’s function integral form.In the multiwavelet basis,the formally non local integral operator decays rapidly to negligible values away from the main diagonal,yielding an effectively banded structure where the bandwidth is only dictated by the requested accuracy.This sparse operator structure has been exploited to achieve linear scaling and parallel algorithms.Parallelization has been achieved both through the shared memory(OpenMP)and the message passing interface(MPI)paradigm.Our implementation has been tested by computing the electrostatic potential of the electronic density of long-chain alkanes and diamond fragments showing(sub)linear scaling with the system size and efficent parallelization.展开更多
Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than ...Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.展开更多
The last decade witnessed rapid increase in multimedia and other applications that require transmitting and protecting huge amount of data streams simultaneously.For such applications,a high-performance cryptosystem i...The last decade witnessed rapid increase in multimedia and other applications that require transmitting and protecting huge amount of data streams simultaneously.For such applications,a high-performance cryptosystem is compulsory to provide necessary security services.Elliptic curve cryptosystem(ECC)has been introduced as a considerable option.However,the usual sequential implementation of ECC and the standard elliptic curve(EC)form cannot achieve required performance level.Moreover,the widely used Hardware implementation of ECC is costly option and may be not affordable.This research aims to develop a high-performance parallel software implementation for ECC.To achieve this,many experiments were performed to examine several factors affecting ECC performance including the projective coordinates,the scalar multiplication algorithm,the elliptic curve(EC)form,and the parallel implementation.The ECC performance was analyzed using the different factors to tune-up them and select the best choices to increase the speed of the cryptosystem.Experimental results illustrated that parallel Montgomery ECC implementation using homogenous projection achieves the highest performance level,since it scored the shortest time delay for ECC computations.In addition,results showed thatNAF algorithm consumes less time to perform encryption and scalar multiplication operations in comparison withMontgomery ladder and binarymethods.Java multi-threading technique was adopted to implement ECC computations in parallel.The proposed multithreaded Montgomery ECC implementation significantly improves the performance level compared to previously presented parallel and sequential implementations.展开更多
Accurate and efficient prediction of the aerodynamic performance and flow details of axial-flow com-pressors is of great engineering application value for the aerodynamic design and flow control of axial-flow compres-...Accurate and efficient prediction of the aerodynamic performance and flow details of axial-flow com-pressors is of great engineering application value for the aerodynamic design and flow control of axial-flow compres-sors.In this work,a delayed detached eddy simulation method is developed and applied to numerically simulate the tur-bulent channel flow and the aerodynamic performance of NASA Rotor 35.Several acceleration techniques including parallel implementation are also used to speed up the iteration convergence.The mean velocity distribution and Reyn-olds stress distribution in the boundary layer of turbulent channel flow and the aerodynamic performance curve of NASA Rotor 35 are predicted.The good agreement between the present delayed detached eddy simulation results and the available direct numerical simulation results or experimental data confirms the effectiveness of the developed meth-od in the accurate and efficient prediction of complex flow in turbomachinery.展开更多
This paper focuses on the development of an efficient,three-dimensional,thermo-mechanical,nonlinear-Stokes flow computational model for ice sheet simulation.The model is based on the parallel finite element model deve...This paper focuses on the development of an efficient,three-dimensional,thermo-mechanical,nonlinear-Stokes flow computational model for ice sheet simulation.The model is based on the parallel finite element model developed in[14]which features high-order accurate finite element discretizations on variable resolution grids.Here,we add an improved iterative solution method for treating the nonlinearity of the Stokes problem,a new high-order accurate finite element solver for the temperature equation,and a new conservative finite volume solver for handling mass conservation.The result is an accurate and efficient numerical model for thermo-mechanical glacier and ice-sheet simulations.We demonstrate the improved efficiency of the Stokes solver using the ISMIP-HOM Benchmark experiments and a realistic test case for the Greenland ice-sheet.We also apply our model to the EISMINT-II benchmark experiments and demonstrate stable thermo-mechanical ice sheet evolution on both structured and unstructured meshes.Notably,we find no evidence for the“cold spoke”instabilities observed for these same experiments when using finite difference,shallow-ice approximation models on structured grids.展开更多
基金supported in part by Shanxi Scholarship Council of China(Grant No.2017-key-2)the Natural Science Foundation of Shanxi Province(Grant No.201801D121145)+1 种基金the Natural Science Foundation of China(NSFC)(Grant No.61731014,61705157,61927811)the Program for Guangdong Introducing Innovative and Entrepreneurial Teams。
文摘NIST(National Institute of Standards and Technology) statistical test recognized as the most authoritative is widely used in verifying the randomness of binary sequences. The Non-overlapping Template Matching Test as the 7 th test of the NIST Test Suit is remarkably time consuming and the slow performance is one of the major hurdles in the testing process. In this paper, we present an efficient bit-parallel matching algorithm and segmented scan-based strategy for execution on Graphics Processing Unit(GPU) using NVIDIA Compute Unified Device Architecture(CUDA). Experimental results show the significant performance improvement of the parallelized Non-overlapping Template Matching Test, the running speed is 483 times faster than the original NIST implementation without attenuating the test result accuracy.
文摘In this paper, some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000. They include matrix multiplication, LU factorization of a dense matrix, Cholesky factorization of a symmetric matrix, and eigendecomposition of symmetric matrix for real and complex data types. These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization. The execution time, measured performance and speedup for each problem on Dawning-1000 are shown. For matrix multiplication and LU factorization, 1.86GFLOPS and 1.53GFLOPS are reached.
基金supported by the Research Council of Norway through a Cen-tre of Excellence Grant(Grant No.179568/V30)from the Norwegian Super-computing Program(NOTUR)through a grant of computer time(Grant No.NN4654K).
文摘We present a parallel and linear scaling implementation of the calculation of the electrostatic potential arising from an arbitrary charge distribution.Our approach is making use of the multi-resolution basis of multiwavelets.The potential is obtained as the direct solution of the Poisson equation in its Green’s function integral form.In the multiwavelet basis,the formally non local integral operator decays rapidly to negligible values away from the main diagonal,yielding an effectively banded structure where the bandwidth is only dictated by the requested accuracy.This sparse operator structure has been exploited to achieve linear scaling and parallel algorithms.Parallelization has been achieved both through the shared memory(OpenMP)and the message passing interface(MPI)paradigm.Our implementation has been tested by computing the electrostatic potential of the electronic density of long-chain alkanes and diamond fragments showing(sub)linear scaling with the system size and efficent parallelization.
文摘Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.
基金Authors extend their appreciation to the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University for funding and supporting this work through Graduate Student Research Support Program.
文摘The last decade witnessed rapid increase in multimedia and other applications that require transmitting and protecting huge amount of data streams simultaneously.For such applications,a high-performance cryptosystem is compulsory to provide necessary security services.Elliptic curve cryptosystem(ECC)has been introduced as a considerable option.However,the usual sequential implementation of ECC and the standard elliptic curve(EC)form cannot achieve required performance level.Moreover,the widely used Hardware implementation of ECC is costly option and may be not affordable.This research aims to develop a high-performance parallel software implementation for ECC.To achieve this,many experiments were performed to examine several factors affecting ECC performance including the projective coordinates,the scalar multiplication algorithm,the elliptic curve(EC)form,and the parallel implementation.The ECC performance was analyzed using the different factors to tune-up them and select the best choices to increase the speed of the cryptosystem.Experimental results illustrated that parallel Montgomery ECC implementation using homogenous projection achieves the highest performance level,since it scored the shortest time delay for ECC computations.In addition,results showed thatNAF algorithm consumes less time to perform encryption and scalar multiplication operations in comparison withMontgomery ladder and binarymethods.Java multi-threading technique was adopted to implement ECC computations in parallel.The proposed multithreaded Montgomery ECC implementation significantly improves the performance level compared to previously presented parallel and sequential implementations.
基金National Science and Technology Major Project of China(No.2017-II 0006-0020)National Key Research and Development Project of China(2016YFB0200901)National Natural Science Foundation of China(51776154)。
文摘Accurate and efficient prediction of the aerodynamic performance and flow details of axial-flow com-pressors is of great engineering application value for the aerodynamic design and flow control of axial-flow compres-sors.In this work,a delayed detached eddy simulation method is developed and applied to numerically simulate the tur-bulent channel flow and the aerodynamic performance of NASA Rotor 35.Several acceleration techniques including parallel implementation are also used to speed up the iteration convergence.The mean velocity distribution and Reyn-olds stress distribution in the boundary layer of turbulent channel flow and the aerodynamic performance curve of NASA Rotor 35 are predicted.The good agreement between the present delayed detached eddy simulation results and the available direct numerical simulation results or experimental data confirms the effectiveness of the developed meth-od in the accurate and efficient prediction of complex flow in turbomachinery.
基金the U.S.Department of Energy,Office of Science,Advanced Scientific Computing Research and Biological and Environmental Research programs through the Scientific Discovery through Advanced Computing(SciDAC)project PISCEES,and by the US National Science Foundation under the grant number DMS-1215659'the National 863 Project of China under the grant number 2012AA01A309'the National Center for Mathematics and Interdisciplinary Sciences of the Chinese Academy of Sciences.
文摘This paper focuses on the development of an efficient,three-dimensional,thermo-mechanical,nonlinear-Stokes flow computational model for ice sheet simulation.The model is based on the parallel finite element model developed in[14]which features high-order accurate finite element discretizations on variable resolution grids.Here,we add an improved iterative solution method for treating the nonlinearity of the Stokes problem,a new high-order accurate finite element solver for the temperature equation,and a new conservative finite volume solver for handling mass conservation.The result is an accurate and efficient numerical model for thermo-mechanical glacier and ice-sheet simulations.We demonstrate the improved efficiency of the Stokes solver using the ISMIP-HOM Benchmark experiments and a realistic test case for the Greenland ice-sheet.We also apply our model to the EISMINT-II benchmark experiments and demonstrate stable thermo-mechanical ice sheet evolution on both structured and unstructured meshes.Notably,we find no evidence for the“cold spoke”instabilities observed for these same experiments when using finite difference,shallow-ice approximation models on structured grids.