Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to co...Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenME This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-IA, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.展开更多
This paper reviews the recently developed optical interconnect technologies designed for scalable, low latency and high-throughput comunications within datacenters or high perforrmnce computers. The three typical arch...This paper reviews the recently developed optical interconnect technologies designed for scalable, low latency and high-throughput comunications within datacenters or high perforrmnce computers. The three typical architectures including the broadcast-and-select based Optical Shared Memory Supercomputer Interconnect System (OSMOSIS) switch, the defection routing based Data Vortex switch and the arrayed waveguide grating based Low-latency Interconnect Optical Network Switch (LIONS) switch are discussed in detail. In particular, we investigate the various Ioopback buffering technologies in LIONS and present a proof of principle testbed demonstration showing feasibility of LIONS architecture. Moreover, the performance of LIONS, Data Vortex and OSMOSIS with traditional state-of-the-art electrical switching network based on the Flattened-ButterFly (FBF) architecture in terms of throughput and latency are compared. The sinmlation based perfortmnce study shows that the latency of LIONS is almost independent of the number of input ports and does not saturate even at very high input load.展开更多
To save cost, more and more users choose provision resources at the granularity of virtual machines in cluster systems, especially data centres. Maintaining a consistent member view is the foundation of reliable clust...To save cost, more and more users choose provision resources at the granularity of virtual machines in cluster systems, especially data centres. Maintaining a consistent member view is the foundation of reliable cluster managements, and it also raises several challenge issues for large scale cluster systems deployed with virtual machines (which we call virtualized clusters). In this paper, we introduce our experience in design and implementation of scalable member view management on large-scale virtual clusters. Our research contributions include three-aspects : 1 ) we propose a scalable and reliable management infrastructure that combines a peer-to-peer structure and a hierarchy structure to maintain a consistent member view in virtual clusters; 2 ) we present a light-weighted group membership algorithm that can reach the consistent member view within a single round of message exchange; 3 ) we design and implement a scalable membership service that can provide virtual machines and maintain a consistent member view in virtual clusters. Our work is verified on Dawning 5000A, which ranked No. 10 of Top 500 super computers in November, 2008.展开更多
Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on...Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on modern supercomputers 10% to 30% of the eigenvalues and eigenvectors of huge dense matrices have to be calculated. Therefore, performance and parallel scaling of the used eigensolvers is of upmost interest. In this article different routines of the linear algebra packages ScaLAPACK and Elemental for parallel solution of the symmetric eigenvalue problem are compared concerning their performance on the BlueGene/P supercomputer. Parameters for performance optimization are adjusted for the different data distribution methods used in the two libraries. It is found that for all test cases the new library Elemental which uses a two-dimensional element by element distribution of the matrices to the processors shows better performance than the old ScaLAPACK library which uses a block-cyclic distribution.展开更多
AT the International Supercomputing Conference held in Frankfurt,Germany on June20,2016,the TOP500.org published the latest supercomputer rank ings.China’s Sunway Taihu Light took pole position.This is the seventh ti...AT the International Supercomputing Conference held in Frankfurt,Germany on June20,2016,the TOP500.org published the latest supercomputer rank ings.China’s Sunway Taihu Light took pole position.This is the seventh time in a row that China’s supercomputers have topped the Top500 rankings,published biannually since 1993.展开更多
China’s Supercomputer Helps Construct"Smart Cities"Developers of China’s Tianhe-1A,one of the world’s fastest supercomputers,are tapping into the digital brain’s higher functions,moving it beyond animati...China’s Supercomputer Helps Construct"Smart Cities"Developers of China’s Tianhe-1A,one of the world’s fastest supercomputers,are tapping into the digital brain’s higher functions,moving it beyond animation and Internet financing to help in the construction of new"smart cities."The Tianhe-1A can digitize the planning,design,construction,展开更多
Classical simulations of quantum circuits are limited in both space and time when the qubit count is above 50, the realm where quantum supremacy reigns. However, recently, for the low depth circuit with more than 50 q...Classical simulations of quantum circuits are limited in both space and time when the qubit count is above 50, the realm where quantum supremacy reigns. However, recently, for the low depth circuit with more than 50 qubits, there are several methods of simulation proposed by teams at Google and IBM. Here,we present a scheme of simulation which can extract a large amount of measurement outcomes within a short time, achieving a 64-qubit simulation of a universal random circuit of depth 22 using a 128-node cluster, and 56-and 42-qubit circuits on a single PC. We also estimate that a 72-qubit circuit of depth 23 can be simulated in about 16 h on a supercomputer identical to that used by the IBM team. Moreover, the simulation processes are exceedingly separable, hence parallelizable, involving just a few inter-process communications. Our work enables simulating more qubits with less hardware burden and provides a new perspective for classical simulations.展开更多
High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation ...High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation of discontinuous Galerkin density functional theory(DGDFT)method on the Sunway Taihu Light supercomputer.The DGDFT method uses the adaptive local basis(ALB)functions generated on-the-fly during the self-consistent field(SCF)iteration to solve the KS equations with high precision comparable to plane-wave basis set.In particular,the DGDFT method adopts a two-level parallelization strategy that deals with various types of data distribution,task scheduling,and data communication schemes,and combines with the master–slave multi-thread heterogeneous parallelism of SW26010 processor,resulting in large-scale HPC KS-DFT calculations on the Sunway Taihu Light supercomputer.We show that the DGDFT method can scale up to 8,519,680 processing cores(131,072 core groups)on the Sunway Taihu Light supercomputer for studying the electronic structures of twodimensional(2 D)metallic graphene systems that contain tens of thousands of carbon atoms.展开更多
Gaussian boson sampling is an alternative model for demonstrating quantum computational supremacy,where squeezed states are injected into every input mode, instead of applying single photons as in the case of standard...Gaussian boson sampling is an alternative model for demonstrating quantum computational supremacy,where squeezed states are injected into every input mode, instead of applying single photons as in the case of standard boson sampling. Here by analyzing numerically the computational costs, we establish a lower bound for achieving quantum computational supremacy for a class of Gaussian bosonsampling problems. Specifically, we propose a more efficient method for calculating the transition probabilities, leading to a significant reduction of the simulation costs. Particularly, our numerical results indicate that one can simulate up to 18 photons for Gaussian boson sampling at the output subspace on a normal laptop, 20 photons on a commercial workstation with 256 cores, and about 30 photons for supercomputers. These numbers are significantly smaller than those in standard boson sampling, suggesting that Gaussian boson sampling could be experimentally-friendly for demonstrating quantum computational supremacy.展开更多
基金Project(61170049) supported by the National Natural Science Foundation of ChinaProject(2012AA010903) supported by the National High Technology Research and Development Program of China
文摘Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenME This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-IA, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.
基金the Department of Defense under Contract No.#H88230-08-C-0202the Google Research Awards
文摘This paper reviews the recently developed optical interconnect technologies designed for scalable, low latency and high-throughput comunications within datacenters or high perforrmnce computers. The three typical architectures including the broadcast-and-select based Optical Shared Memory Supercomputer Interconnect System (OSMOSIS) switch, the defection routing based Data Vortex switch and the arrayed waveguide grating based Low-latency Interconnect Optical Network Switch (LIONS) switch are discussed in detail. In particular, we investigate the various Ioopback buffering technologies in LIONS and present a proof of principle testbed demonstration showing feasibility of LIONS architecture. Moreover, the performance of LIONS, Data Vortex and OSMOSIS with traditional state-of-the-art electrical switching network based on the Flattened-ButterFly (FBF) architecture in terms of throughput and latency are compared. The sinmlation based perfortmnce study shows that the latency of LIONS is almost independent of the number of input ports and does not saturate even at very high input load.
基金Supported by the High Technology Research and Development Programme of China (No. 2006AA01 A102, 2009AA01 A129 ) and the National Natural Science Foundation of China ( No. 60703020).
文摘To save cost, more and more users choose provision resources at the granularity of virtual machines in cluster systems, especially data centres. Maintaining a consistent member view is the foundation of reliable cluster managements, and it also raises several challenge issues for large scale cluster systems deployed with virtual machines (which we call virtualized clusters). In this paper, we introduce our experience in design and implementation of scalable member view management on large-scale virtual clusters. Our research contributions include three-aspects : 1 ) we propose a scalable and reliable management infrastructure that combines a peer-to-peer structure and a hierarchy structure to maintain a consistent member view in virtual clusters; 2 ) we present a light-weighted group membership algorithm that can reach the consistent member view within a single round of message exchange; 3 ) we design and implement a scalable membership service that can provide virtual machines and maintain a consistent member view in virtual clusters. Our work is verified on Dawning 5000A, which ranked No. 10 of Top 500 super computers in November, 2008.
文摘Many applications in computational science and engineering require the computation of eigenvalues and vectors of dense symmetric or Hermitian matrices. For example, in DFT (density functional theory) calculations on modern supercomputers 10% to 30% of the eigenvalues and eigenvectors of huge dense matrices have to be calculated. Therefore, performance and parallel scaling of the used eigensolvers is of upmost interest. In this article different routines of the linear algebra packages ScaLAPACK and Elemental for parallel solution of the symmetric eigenvalue problem are compared concerning their performance on the BlueGene/P supercomputer. Parameters for performance optimization are adjusted for the different data distribution methods used in the two libraries. It is found that for all test cases the new library Elemental which uses a two-dimensional element by element distribution of the matrices to the processors shows better performance than the old ScaLAPACK library which uses a block-cyclic distribution.
文摘AT the International Supercomputing Conference held in Frankfurt,Germany on June20,2016,the TOP500.org published the latest supercomputer rank ings.China’s Sunway Taihu Light took pole position.This is the seventh time in a row that China’s supercomputers have topped the Top500 rankings,published biannually since 1993.
文摘China’s Supercomputer Helps Construct"Smart Cities"Developers of China’s Tianhe-1A,one of the world’s fastest supercomputers,are tapping into the digital brain’s higher functions,moving it beyond animation and Internet financing to help in the construction of new"smart cities."The Tianhe-1A can digitize the planning,design,construction,
基金supported by the National Key Research and Development Program of China(2016YFA0301700)the National Natural Science Foundation of China(11625419)+1 种基金the Anhui Initiative in Quantum Information Technologies(AHY080000)supported by Yangzi Cloud Computing Data Centre and Gyrotech,Nanjing,China
文摘Classical simulations of quantum circuits are limited in both space and time when the qubit count is above 50, the realm where quantum supremacy reigns. However, recently, for the low depth circuit with more than 50 qubits, there are several methods of simulation proposed by teams at Google and IBM. Here,we present a scheme of simulation which can extract a large amount of measurement outcomes within a short time, achieving a 64-qubit simulation of a universal random circuit of depth 22 using a 128-node cluster, and 56-and 42-qubit circuits on a single PC. We also estimate that a 72-qubit circuit of depth 23 can be simulated in about 16 h on a supercomputer identical to that used by the IBM team. Moreover, the simulation processes are exceedingly separable, hence parallelizable, involving just a few inter-process communications. Our work enables simulating more qubits with less hardware burden and provides a new perspective for classical simulations.
基金partly supported by the Supercomputer Application Project Trail Funding from Wuxi Jiangnan Institute of Computing Technology(BB2340000016)the Strategic Priority Research Program of Chinese Academy of Sciences(XDC01040100)+6 种基金the National Natural Science Foundation of China(21688102,21803066)the Anhui Initiative in Quantum Information Technologies(AHY090400)the National Key Research and Development Program of China(2016YFA0200604)the Fundamental Research Funds for Central Universities(WK2340000091)the Chinese Academy of Sciences Pioneer Hundred Talents Program(KJ2340000031)the Research Start-Up Grants(KY2340000094)the Academic Leading Talents Training Program(KY2340000103)from University of Science and Technology of China。
文摘High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation of discontinuous Galerkin density functional theory(DGDFT)method on the Sunway Taihu Light supercomputer.The DGDFT method uses the adaptive local basis(ALB)functions generated on-the-fly during the self-consistent field(SCF)iteration to solve the KS equations with high precision comparable to plane-wave basis set.In particular,the DGDFT method adopts a two-level parallelization strategy that deals with various types of data distribution,task scheduling,and data communication schemes,and combines with the master–slave multi-thread heterogeneous parallelism of SW26010 processor,resulting in large-scale HPC KS-DFT calculations on the Sunway Taihu Light supercomputer.We show that the DGDFT method can scale up to 8,519,680 processing cores(131,072 core groups)on the Sunway Taihu Light supercomputer for studying the electronic structures of twodimensional(2 D)metallic graphene systems that contain tens of thousands of carbon atoms.
基金supported by the Guangdong Innovative and Entrepreneurial Research Team Program (2016ZT06D348)Natural Science Foundation of Guangdong Province (2017B030308003)+6 种基金the Key R&D Program of Guangdong Province (2018B030326001)the Science, Technology and Innovation Commission of Shenzhen Municipality (JCYJ20170412152620376, JCYJ20170817105046702 and KYTDPT20181011104202253)the National Natural Science Foundation of China (11875160 and U1801661)supported by the National Natural Science Foundation of China (61832003, 61872334)the Economy, Trade and Information Commission of Shenzhen Municipality (201901161512)the Strategic Priority Research Program of Chinese Academy of Sciences (XDB28000000)K. C. Wong Education Foundation
文摘Gaussian boson sampling is an alternative model for demonstrating quantum computational supremacy,where squeezed states are injected into every input mode, instead of applying single photons as in the case of standard boson sampling. Here by analyzing numerically the computational costs, we establish a lower bound for achieving quantum computational supremacy for a class of Gaussian bosonsampling problems. Specifically, we propose a more efficient method for calculating the transition probabilities, leading to a significant reduction of the simulation costs. Particularly, our numerical results indicate that one can simulate up to 18 photons for Gaussian boson sampling at the output subspace on a normal laptop, 20 photons on a commercial workstation with 256 cores, and about 30 photons for supercomputers. These numbers are significantly smaller than those in standard boson sampling, suggesting that Gaussian boson sampling could be experimentally-friendly for demonstrating quantum computational supremacy.