Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to co...Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenME This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-IA, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.展开更多
Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on...Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on modern supercomputers 10% to 30% of the eigenvalues and eigenvectors of huge dense matrices have to be calculated. Therefore, performance and parallel scaling of the used eigensolvers is of upmost interest. In this article different routines of the linear algebra packages ScaLAPACK and Elemental for parallel solution of the symmetric eigenvalue problem are compared concerning their performance on the BlueGene/P supercomputer. Parameters for performance optimization are adjusted for the different data distribution methods used in the two libraries. It is found that for all test cases the new library Elemental which uses a two-dimensional element by element distribution of the matrices to the processors shows better performance than the old ScaLAPACK library which uses a block-cyclic distribution.展开更多
Classical simulations of quantum circuits are limited in both space and time when the qubit count is above 50, the realm where quantum supremacy reigns. However, recently, for the low depth circuit with more than 50 q...Classical simulations of quantum circuits are limited in both space and time when the qubit count is above 50, the realm where quantum supremacy reigns. However, recently, for the low depth circuit with more than 50 qubits, there are several methods of simulation proposed by teams at Google and IBM. Here,we present a scheme of simulation which can extract a large amount of measurement outcomes within a short time, achieving a 64-qubit simulation of a universal random circuit of depth 22 using a 128-node cluster, and 56-and 42-qubit circuits on a single PC. We also estimate that a 72-qubit circuit of depth 23 can be simulated in about 16 h on a supercomputer identical to that used by the IBM team. Moreover, the simulation processes are exceedingly separable, hence parallelizable, involving just a few inter-process communications. Our work enables simulating more qubits with less hardware burden and provides a new perspective for classical simulations.展开更多
基金Project(61170049) supported by the National Natural Science Foundation of ChinaProject(2012AA010903) supported by the National High Technology Research and Development Program of China
文摘Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenME This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-IA, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.
文摘Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on modern supercomputers 10% to 30% of the eigenvalues and eigenvectors of huge dense matrices have to be calculated. Therefore, performance and parallel scaling of the used eigensolvers is of upmost interest. In this article different routines of the linear algebra packages ScaLAPACK and Elemental for parallel solution of the symmetric eigenvalue problem are compared concerning their performance on the BlueGene/P supercomputer. Parameters for performance optimization are adjusted for the different data distribution methods used in the two libraries. It is found that for all test cases the new library Elemental which uses a two-dimensional element by element distribution of the matrices to the processors shows better performance than the old ScaLAPACK library which uses a block-cyclic distribution.
基金supported by the National Key Research and Development Program of China(2016YFA0301700)the National Natural Science Foundation of China(11625419)+1 种基金the Anhui Initiative in Quantum Information Technologies(AHY080000)supported by Yangzi Cloud Computing Data Centre and Gyrotech,Nanjing,China
文摘Classical simulations of quantum circuits are limited in both space and time when the qubit count is above 50, the realm where quantum supremacy reigns. However, recently, for the low depth circuit with more than 50 qubits, there are several methods of simulation proposed by teams at Google and IBM. Here,we present a scheme of simulation which can extract a large amount of measurement outcomes within a short time, achieving a 64-qubit simulation of a universal random circuit of depth 22 using a 128-node cluster, and 56-and 42-qubit circuits on a single PC. We also estimate that a 72-qubit circuit of depth 23 can be simulated in about 16 h on a supercomputer identical to that used by the IBM team. Moreover, the simulation processes are exceedingly separable, hence parallelizable, involving just a few inter-process communications. Our work enables simulating more qubits with less hardware burden and provides a new perspective for classical simulations.