The China dual-functional lithium–lead test blanket module(DFLL-TBM) is a liquid Li Pb blanket concept developed by the Institute of Nuclear Energy Safety Technology of the Chinese Academy of Sciences for testing in ...The China dual-functional lithium–lead test blanket module(DFLL-TBM) is a liquid Li Pb blanket concept developed by the Institute of Nuclear Energy Safety Technology of the Chinese Academy of Sciences for testing in ITER to validate relevant tritium breeding and shielding technologies. In this study, neutronic calculations of DFLL-TBM were carried out using a massively parallel three-dimensional transport code, Hydra, with the Fusion Evaluated Nuclear Data Library/MG. Hydra was developed by the Nuclear Engineering Computational Physics Lab based on the discrete ordinates method and has been devoted to neutronic analysis and shielding evaluation for nuclear facilities. An in-house Monte Carlo code(MCX) was employed to verify the discretized calculation model used by Hydra for the DFLL-TBM calculations. The results showed two key aspects:(1) In most material zones,Hydra solutions are in good agreement with the reference MCX results within 1%, and the maximal relative difference of the neutron flux is merely 3%, demonstrating the correctness of the calculation model;(2) while the current DFLL-TBM design meets the operation shielding requirement of ITER for 4 years, it does not satisfy the tritium self-sufficiency requirement. Compared to the two-step approach, Hydra produces higher accuracies as it does not rely on the homogenization technique during the calculation process. The parallel efficiency tests of Hydra using the DFLL-TBM model also showed that this code maintains a high parallel efficiency on O(100) processors and, as a result, is able to significantly improve computing performance through parallelization. Parameter studies have been carried out by varying the thickness of the beryllium armor layer and the tritium breeding zone to understand the influence of the beryllium layer and breeding zone thickness on tritium breeding performance. This establishes a foundation for further improvement in the tritium production performance of DFLL-TBM.展开更多
A novel framework for parallel subgraph isomorphism on GPUs is proposed, named GPUSI, which consists of GPU region exploration and GPU subgraph matching. The GPUSI iteratively enumerates subgraph instances and solves ...A novel framework for parallel subgraph isomorphism on GPUs is proposed, named GPUSI, which consists of GPU region exploration and GPU subgraph matching. The GPUSI iteratively enumerates subgraph instances and solves the subgraph isomorphism in a divide-and-conquer fashion. The framework completely relies on the graph traversal, and avoids the explicit join operation. Moreover, in order to improve its performance, a task-queue based method and the virtual-CSR graph structure are used to balance the workload among warps, and warp-centric programming model is used to balance the workload among threads in a warp. The prototype of GPUSI is implemented, and comprehensive experiments of various graph isomorphism operations are carried on diverse large graphs. The experiments clearly demonstrate that GPUSI has good scalability and can achieve speed-up of 1.4–2.6 compared to the state-of-the-art solutions.展开更多
The key to large-scale parallel solutions of deterministic particle transport problem is single-node computation performance. Hence, single-node computation is often parallelized on multi-core or many-core computer ar...The key to large-scale parallel solutions of deterministic particle transport problem is single-node computation performance. Hence, single-node computation is often parallelized on multi-core or many-core computer architectures. However, the number of on-chip cores grows quickly with the scale-down of feature size in semiconductor technology. In this paper, we present a scalability investigation of one energy group time-independent deterministic discrete ordinates neutron transport in 3D Cartesian geometry(Sweep3D) on Intel's Many Integrated Core(MIC) architecture, which can provide up to 62 cores with four hardware threads per core now and will own up to 72 in the future. The parallel programming model, Open MP, and vector intrinsic functions are used to exploit thread parallelism and vector parallelism for the discrete ordinates method, respectively. The results on a 57-core MIC coprocessor show that the implementation of Sweep3 D on MIC has good scalability in performance. In addition, the application of the Roofline model to assess the implementation and performance comparison between MIC and Tesla K20 C Graphics Processing Unit(GPU) are also reported.展开更多
A data-driven method was proposed to realistically animate garments on human poses in reduced space. Firstly, a gradient based method was extended to generate motion sequences and garments were simulated on the sequen...A data-driven method was proposed to realistically animate garments on human poses in reduced space. Firstly, a gradient based method was extended to generate motion sequences and garments were simulated on the sequences as our training data. Based on the examples, the proposed method can fast output realistic garments on new poses. Our framework can be mainly divided into offline phase and online phase. During the offline phase, based on linear blend skinning(LBS), rigid bones and flex bones were estimated for human bodies and garments, respectively. Then, rigid bone weight maps on garment vertices were learned from examples. In the online phase, new human poses were treated as input to estimate rigid bone transformations. Then, both rigid bones and flex bones were used to drive garments to fit the new poses. Finally, a novel formulation was also proposed to efficiently deal with garment-body penetration. Experiments manifest that our method is fast and accurate. The intersection artifacts are fast removed and final garment results are quite realistic.展开更多
Breadth-first search(BFS) is an important kernel for graph traversal and has been used by many graph processing applications. Extensive studies have been devoted in boosting the performance of BFS. As the most effecti...Breadth-first search(BFS) is an important kernel for graph traversal and has been used by many graph processing applications. Extensive studies have been devoted in boosting the performance of BFS. As the most effective solution, GPU-acceleration achieves the state-of-the-art result of 3.3×109 traversed edges per second on a NVIDIA Tesla C2050 GPU. A novel vertex frontier based GPU BFS algorithm is proposed, and its main features are three-fold. Firstly, to obtain a better workload balance for irregular graphs, a virtual-queue task decomposition and mapping strategy is introduced for vertex frontier expanding. Secondly, a global deduplicate detection scheme is proposed to remove reduplicative vertices from vertex frontier effectively. Finally, a GPU-based bottom-up BFS approach is employed to process large frontier. The experimental results demonstrate that the algorithm can achieve 10% improvement over the state-of-the-art method on diverse graphs. Especially, it exhibits 2-3 times speedup on low-diameter and scale-free graphs over the state-of-the-art on a NVIDIA Tesla K20 c GPU, reaching a peak traversal rate of 11.2×109 edges/s.展开更多
The contribution of parasitic bipolar amplification to SETs is experimentally verified using two P-hit target chains in the normal layout and in the special layout. For PMOSs in the normal layout, the single-event cha...The contribution of parasitic bipolar amplification to SETs is experimentally verified using two P-hit target chains in the normal layout and in the special layout. For PMOSs in the normal layout, the single-event charge collection is composed of diffusion, drift, and the parasitic bipolar effect, while for PMOSs in the special layout, the parasitic bipolar junction transistor cannot turn on. Heavy ion experimental results show that PMOSs without parasitic bipolar amplification have a 21.4% decrease in the average SET pulse width and roughly a 40.2% reduction in the SET cross-section.展开更多
It is widely believed that Shor's factoring algorithm provides a driving force to boost the quantum computing research.However, a serious obstacle to its binary implementation is the large number of quantum gates. No...It is widely believed that Shor's factoring algorithm provides a driving force to boost the quantum computing research.However, a serious obstacle to its binary implementation is the large number of quantum gates. Non-binary quantum computing is an efficient way to reduce the required number of elemental gates. Here, we propose optimization schemes for Shor's algorithm implementation and take a ternary version for factorizing 21 as an example. The optimized factorization is achieved by a two-qutrit quantum circuit, which consists of only two single qutrit gates and one ternary controlled-NOT gate. This two-qutrit quantum circuit is then encoded into the nine lower vibrational states of an ion trapped in a weakly anharmonic potential. Optimal control theory(OCT) is employed to derive the manipulation electric field for transferring the encoded states. The ternary Shor's algorithm can be implemented in one single step. Numerical simulation results show that the accuracy of the state transformations is about 0.9919.展开更多
Magnetotelluric(MT)inversion is an illposed problem and the standard way to address it is through regularization,by adding a stabilizing functional to the data objective functional in order to obtain a stable solution...Magnetotelluric(MT)inversion is an illposed problem and the standard way to address it is through regularization,by adding a stabilizing functional to the data objective functional in order to obtain a stable solution.The traditional stabilizing functionals,in which a low-order differential operator is used,yield a smooth solution that may not be appropriate when anomalies occur in block patterns.In some cases the focused imaging of a sharp electrical boundary is necessary.Even though various experiments have used stabilizing functionals that are suitable to obtain a clear and sharp boundary,such as the minimum support(MS)and the minimum gradient support(MGS)functionals,there are still some limitations in practice.In this paper,the minimum support gradient(MSG)is proposed as the stabilizing functional.Under the uniform regularization framework,a regularized inversion with a variety of stabilizing functionals is performed and the inversion results are compared.This study shows that MSG inversion can not only obtain a clearly focused inversion but also a quite stable and robust one.展开更多
As the big data era is coming, it brings new challenges to the massive data processing. A combination of GPU and CPU on chip is the trend to release the pressure of large scale computing. We found that there are diffe...As the big data era is coming, it brings new challenges to the massive data processing. A combination of GPU and CPU on chip is the trend to release the pressure of large scale computing. We found that there are different memory access characteristics between GPU and CPU. The most important one is that the programs of GPU include a large number of threads, which lead to higher access frequency in cache than the CPU programs. Although the LRU policy favors the programs with high memory access frequency, the programs of GPU can’t get the corresponding performance boost even more cache resources are provided. So LRU policy is not suitable for heterogeneous multi-core processor. Based on the different characteristics of GPU and CPU programs on memory access, this paper proposes an LLC dynamic replacement policy--DIPP (Dynamic Insertion / Promotion Policy) for heterogeneous multi-core processors.The core idea of the replacement policy is to reduce the miss rate of the program and enhance the overall system performance by limiting the cache resources that GPU can acquire and reducing the thread interferences between programs. Experiments compare the DIPP replacement policy with LRU and we conduct a classified discussion according to the program results of GPU. Friendly programs enhance 23.29% on the average performance (using arithmetic mean).Large working sets programs can improve 13.95%, compute-intensive programs enhance 9.66% and stream class programs improve 3.8%.展开更多
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design...On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.展开更多
Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a...Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a multi-dimensional sequence over the data stream to satisfy the requirements of accuracy and high speed. It is because:(1) Redundant dimensions in sequence data and large state space lead to a poor ability for sequence modeling;(2) Anomaly detection cannot adapt to the high-speed nature of the data stream, especially when concept drift occurs, and it will reduce the detection rate. On one hand, most existing methods of sequence anomaly detection focus on the single-dimension sequence. On the other hand, some studies concerning multi-dimensional sequence concentrate mainly on the static database rather than the data stream. To improve the performance of anomaly detection for a multi-dimensional sequence over the data stream, we propose a novel unsupervised fast and accurate anomaly detection(FAAD) method which includes three algorithms. First, a method called "information calculation and minimum spanning tree cluster" is adopted to reduce redundant dimensions. Second, to speed up model construction and ensure the detection rate for the sequence over the data stream, we propose a method called"random sampling and subsequence partitioning based on the index probabilistic suffix tree." Last, the method called "anomaly buffer based on model dynamic adjustment" dramatically reduces the effects of concept drift in the data stream. FAAD is implemented on the streaming platform Storm to detect multi-dimensional log audit data.Compared with the existing anomaly detection methods, FAAD has a good performance in detection rate and speed without being affected by concept drift.展开更多
Performance and energy consumption of high performance computing (HPC) interconnection networks have a great significance in the whole supercomputer, and building up HPC interconnection network simulation plat- form...Performance and energy consumption of high performance computing (HPC) interconnection networks have a great significance in the whole supercomputer, and building up HPC interconnection network simulation plat- form is very important for the research on HPC software and hardware technologies. To effectively evaluate the per- formance and energy consumption of HPC interconnection networks, this article designs and implements a detailed and clock-driven HPC interconnection network simulation plat- form, called HPC-NetSim. HPC-NetSim uses application- driven workloads and inherits the characteristics of the de- tailed and flexible cycle-accurate network simulator. Besides, it offers a large set of configurable network parameters in terms of topology and routing, and supports router's on/off states. We compare the simulated execution time with the real execution time of Tianhe-2 subsystem and the mean error is only 2.7%. In addition, we simulate the network behaviors with different network structures and low-power modes. The results are also consistent with the theoretical analyses.展开更多
Determinism is very useful to multithreaded programs in debugging, testing, etc. Many deterministic ap- proaches have been proposed, such as deterministic multithreading (DMT) and deterministic replay. However, thes...Determinism is very useful to multithreaded programs in debugging, testing, etc. Many deterministic ap- proaches have been proposed, such as deterministic multithreading (DMT) and deterministic replay. However, these sys- tems either are inefficient or target a single purpose, which is not flexible. In this paper, we propose an efficient and flexible deterministic framework for multithreaded programs. Our framework implements determinism in two steps: relaxed determinism and strong determinism. Relaxed determinism solves data races eificiently by using a proper weak memory consistency model. After that, we implement strong determinism by solving lock contentions deterministically. Since we can apply different approaches for these two steps independently, our framework provides a spectrum of deterministic choices, including nondeterministic system (fast), weak deterministic system (fast and conditionally deterministic), DMT system, and deternfinistic replay system. Our evaluation shows that the DMT configuration of this framework could even outperform a state-of-the-art DMT system.展开更多
In this paper, we present the Tianhe-2 interconnect network and message passing services. We describe the architecture of the router and network interface chips, and highlight a set of hardware and software features e...In this paper, we present the Tianhe-2 interconnect network and message passing services. We describe the architecture of the router and network interface chips, and highlight a set of hardware and software features effectively supporting high performance communications, ranging over remote direct memory access, collective optimization, hardwareenable reliable end-to-end communication, user-level message passing services, etc. Measured hardware performance results are also presented.展开更多
Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interpr...Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interprocessot communications, and continuous efforts are devoted to the development of our proprietary interconnect. This paper describes the state-of-the-art of our proprietary interconnect, especially emphasizing on the design of network interface. Several key features are introduced, such as user-level communication, remote direct memory access, offload collective operation, and hardware reliable end-to-end communication, etc. The design of a low level message passing infrastructures and an upper message passing services are also proposed. The preliminary performance results demonstrate the efficiency of the TH interconnect interface.展开更多
With the rapid increase of the size of applications and the complexity of the supercomputer architecture,topology-aware process mapping becomes increasingly important.High communication cost has become a dominant cons...With the rapid increase of the size of applications and the complexity of the supercomputer architecture,topology-aware process mapping becomes increasingly important.High communication cost has become a dominant constraint of the performance of applications running on the supercomputer.To avoid a bad mapping strategy which can lead to terrible communication performance,we propose an optimized heuristic topology-aware mapping algorithm(OHTMA).The algorithm attempts to minimize the hop-byte metric that we use to measure the mapping results.OHTMA incorporates a new greedy heuristic method and pair-exchange-based optimization.It reduces the number of long-distance communications and effectively enhances the locality of the communication.Experimental results on the Tianhe-3 exascale supercomputer prototype indicate that OHTMA can significantly reduce the communication costs.展开更多
With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the us...With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay- 2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.展开更多
Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy...Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.展开更多
Deep neural networks(DNNs)have recently shown great potential in solving partial differential equations(PDEs).The success of neural network-based surrogate models is attributed to their ability to learn a rich set of ...Deep neural networks(DNNs)have recently shown great potential in solving partial differential equations(PDEs).The success of neural network-based surrogate models is attributed to their ability to learn a rich set of solution-related features.However,learning DNNs usually involves tedious training iterations to converge and requires a very large number of training data,which hinders the application of these models to complex physical contexts.To address this problem,we propose to apply the transfer learning approach to DNN-based PDE solving tasks.In our work,we create pairs of transfer experiments on Helmholtz and Navier-Stokes equations by constructing subtasks with different source terms and Reynolds numbers.We also conduct a series of experiments to investigate the degree of generality of the features between different equations.Our results demonstrate that despite differences in underlying PDE systems,the transfer methodology can lead to a significant improvement in the accuracy of the predicted solutions and achieve a maximum performance boost of 97.3%on widely used surrogate models.展开更多
基金the National Key Research and Development Program of China(Nos.2018YFB0204301,2017YFB0202104,and 2017YFE0302200)。
文摘The China dual-functional lithium–lead test blanket module(DFLL-TBM) is a liquid Li Pb blanket concept developed by the Institute of Nuclear Energy Safety Technology of the Chinese Academy of Sciences for testing in ITER to validate relevant tritium breeding and shielding technologies. In this study, neutronic calculations of DFLL-TBM were carried out using a massively parallel three-dimensional transport code, Hydra, with the Fusion Evaluated Nuclear Data Library/MG. Hydra was developed by the Nuclear Engineering Computational Physics Lab based on the discrete ordinates method and has been devoted to neutronic analysis and shielding evaluation for nuclear facilities. An in-house Monte Carlo code(MCX) was employed to verify the discretized calculation model used by Hydra for the DFLL-TBM calculations. The results showed two key aspects:(1) In most material zones,Hydra solutions are in good agreement with the reference MCX results within 1%, and the maximal relative difference of the neutron flux is merely 3%, demonstrating the correctness of the calculation model;(2) while the current DFLL-TBM design meets the operation shielding requirement of ITER for 4 years, it does not satisfy the tritium self-sufficiency requirement. Compared to the two-step approach, Hydra produces higher accuracies as it does not rely on the homogenization technique during the calculation process. The parallel efficiency tests of Hydra using the DFLL-TBM model also showed that this code maintains a high parallel efficiency on O(100) processors and, as a result, is able to significantly improve computing performance through parallelization. Parameter studies have been carried out by varying the thickness of the beryllium armor layer and the tritium breeding zone to understand the influence of the beryllium layer and breeding zone thickness on tritium breeding performance. This establishes a foundation for further improvement in the tritium production performance of DFLL-TBM.
基金Projects(61272142,61103082,61003075,61170261,61103193)supported by the National Natural Science Foundation of ChinaProject supported by Funds for New Century Excellent Talents in University of ChinaProjects(2012AA01A301,2012AA010901)supported by the National High Technology Research and Development Program of China
文摘A novel framework for parallel subgraph isomorphism on GPUs is proposed, named GPUSI, which consists of GPU region exploration and GPU subgraph matching. The GPUSI iteratively enumerates subgraph instances and solves the subgraph isomorphism in a divide-and-conquer fashion. The framework completely relies on the graph traversal, and avoids the explicit join operation. Moreover, in order to improve its performance, a task-queue based method and the virtual-CSR graph structure are used to balance the workload among warps, and warp-centric programming model is used to balance the workload among threads in a warp. The prototype of GPUSI is implemented, and comprehensive experiments of various graph isomorphism operations are carried on diverse large graphs. The experiments clearly demonstrate that GPUSI has good scalability and can achieve speed-up of 1.4–2.6 compared to the state-of-the-art solutions.
基金Supported by National Natural Science Foundation of China(Nos.61402039,61170083,60970033,61373032 and 91430218)National High Technology Research and Development Program of China(No.2012AA01A301)+1 种基金China Postdoctoral Science Foundation(No.2014M562570)National Key Basic Research Program of China(No.61312701001)
文摘The key to large-scale parallel solutions of deterministic particle transport problem is single-node computation performance. Hence, single-node computation is often parallelized on multi-core or many-core computer architectures. However, the number of on-chip cores grows quickly with the scale-down of feature size in semiconductor technology. In this paper, we present a scalability investigation of one energy group time-independent deterministic discrete ordinates neutron transport in 3D Cartesian geometry(Sweep3D) on Intel's Many Integrated Core(MIC) architecture, which can provide up to 62 cores with four hardware threads per core now and will own up to 72 in the future. The parallel programming model, Open MP, and vector intrinsic functions are used to exploit thread parallelism and vector parallelism for the discrete ordinates method, respectively. The results on a 57-core MIC coprocessor show that the implementation of Sweep3 D on MIC has good scalability in performance. In addition, the application of the Roofline model to assess the implementation and performance comparison between MIC and Tesla K20 C Graphics Processing Unit(GPU) are also reported.
基金Project(20104307110003)supported by the Research Fund for the Doctoral Program of Higher Education of ChinaProjects(61379103,61202333,61303185)supported by the National Natural Science Foundation of China+1 种基金Project(2012M520392)supported by the China Postdoctoral Science FoundationProject(CX2012B027)supported by the Hunan Province Graduate Student Innovation Program,China
文摘A data-driven method was proposed to realistically animate garments on human poses in reduced space. Firstly, a gradient based method was extended to generate motion sequences and garments were simulated on the sequences as our training data. Based on the examples, the proposed method can fast output realistic garments on new poses. Our framework can be mainly divided into offline phase and online phase. During the offline phase, based on linear blend skinning(LBS), rigid bones and flex bones were estimated for human bodies and garments, respectively. Then, rigid bone weight maps on garment vertices were learned from examples. In the online phase, new human poses were treated as input to estimate rigid bone transformations. Then, both rigid bones and flex bones were used to drive garments to fit the new poses. Finally, a novel formulation was also proposed to efficiently deal with garment-body penetration. Experiments manifest that our method is fast and accurate. The intersection artifacts are fast removed and final garment results are quite realistic.
基金Projects(61272142,61103082,61003075,61170261,61103193)supported by the National Natural Science Foundation of ChinaProject supported by the Program for New Century Excellent Talents in University of ChinaProjects(2012AA01A301,2012AA010901)supported by the National High Technology Research and Development Program of China
文摘Breadth-first search(BFS) is an important kernel for graph traversal and has been used by many graph processing applications. Extensive studies have been devoted in boosting the performance of BFS. As the most effective solution, GPU-acceleration achieves the state-of-the-art result of 3.3×109 traversed edges per second on a NVIDIA Tesla C2050 GPU. A novel vertex frontier based GPU BFS algorithm is proposed, and its main features are three-fold. Firstly, to obtain a better workload balance for irregular graphs, a virtual-queue task decomposition and mapping strategy is introduced for vertex frontier expanding. Secondly, a global deduplicate detection scheme is proposed to remove reduplicative vertices from vertex frontier effectively. Finally, a GPU-based bottom-up BFS approach is employed to process large frontier. The experimental results demonstrate that the algorithm can achieve 10% improvement over the state-of-the-art method on diverse graphs. Especially, it exhibits 2-3 times speedup on low-diameter and scale-free graphs over the state-of-the-art on a NVIDIA Tesla K20 c GPU, reaching a peak traversal rate of 11.2×109 edges/s.
基金supported by the National Natural Science Foundation of China(Grant No.61376109)
文摘The contribution of parasitic bipolar amplification to SETs is experimentally verified using two P-hit target chains in the normal layout and in the special layout. For PMOSs in the normal layout, the single-event charge collection is composed of diffusion, drift, and the parasitic bipolar effect, while for PMOSs in the special layout, the parasitic bipolar junction transistor cannot turn on. Heavy ion experimental results show that PMOSs without parasitic bipolar amplification have a 21.4% decrease in the average SET pulse width and roughly a 40.2% reduction in the SET cross-section.
基金supported by the National Natural Science Foundation of China(Grant No.61205108)the High Performance Computing(HPC)Foundation of National University of Defense Technology,China
文摘It is widely believed that Shor's factoring algorithm provides a driving force to boost the quantum computing research.However, a serious obstacle to its binary implementation is the large number of quantum gates. Non-binary quantum computing is an efficient way to reduce the required number of elemental gates. Here, we propose optimization schemes for Shor's algorithm implementation and take a ternary version for factorizing 21 as an example. The optimized factorization is achieved by a two-qutrit quantum circuit, which consists of only two single qutrit gates and one ternary controlled-NOT gate. This two-qutrit quantum circuit is then encoded into the nine lower vibrational states of an ion trapped in a weakly anharmonic potential. Optimal control theory(OCT) is employed to derive the manipulation electric field for transferring the encoded states. The ternary Shor's algorithm can be implemented in one single step. Numerical simulation results show that the accuracy of the state transformations is about 0.9919.
基金the National Natural Science Foundation of China(No.41630317)the National Key Research and Development Program of China(No.2017YFC0602405).
文摘Magnetotelluric(MT)inversion is an illposed problem and the standard way to address it is through regularization,by adding a stabilizing functional to the data objective functional in order to obtain a stable solution.The traditional stabilizing functionals,in which a low-order differential operator is used,yield a smooth solution that may not be appropriate when anomalies occur in block patterns.In some cases the focused imaging of a sharp electrical boundary is necessary.Even though various experiments have used stabilizing functionals that are suitable to obtain a clear and sharp boundary,such as the minimum support(MS)and the minimum gradient support(MGS)functionals,there are still some limitations in practice.In this paper,the minimum support gradient(MSG)is proposed as the stabilizing functional.Under the uniform regularization framework,a regularized inversion with a variety of stabilizing functionals is performed and the inversion results are compared.This study shows that MSG inversion can not only obtain a clearly focused inversion but also a quite stable and robust one.
文摘As the big data era is coming, it brings new challenges to the massive data processing. A combination of GPU and CPU on chip is the trend to release the pressure of large scale computing. We found that there are different memory access characteristics between GPU and CPU. The most important one is that the programs of GPU include a large number of threads, which lead to higher access frequency in cache than the CPU programs. Although the LRU policy favors the programs with high memory access frequency, the programs of GPU can’t get the corresponding performance boost even more cache resources are provided. So LRU policy is not suitable for heterogeneous multi-core processor. Based on the different characteristics of GPU and CPU programs on memory access, this paper proposes an LLC dynamic replacement policy--DIPP (Dynamic Insertion / Promotion Policy) for heterogeneous multi-core processors.The core idea of the replacement policy is to reduce the miss rate of the program and enhance the overall system performance by limiting the cache resources that GPU can acquire and reducing the thread interferences between programs. Experiments compare the DIPP replacement policy with LRU and we conduct a classified discussion according to the program results of GPU. Friendly programs enhance 23.29% on the average performance (using arithmetic mean).Large working sets programs can improve 13.95%, compute-intensive programs enhance 9.66% and stream class programs improve 3.8%.
基金supported by the National Key Research and Development Program of China(2021ZD40303)the National Natural Science Foundation of China(62225205 and 92055213)+1 种基金Natural Science Foundation of Hunan Province of China(2021JJ10023)Shenzhen Basic Research Project(Natural Science Foundation)(JCYJ20210324140002006)。
基金Acknowledgements This work was partially supported by the Na- tional High-tech R&D Program of China (863 Program) (2012AA01A301), and the National Natural Science Foundation of China (Grant No. 61120106005). The MilkyWay-2 project is a great team effort and benefits from the cooperation of many individuals at NUDT. We thank all the people who have contributed to the system in a variety of ways.
文摘On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
基金Project supported by the National Key R&D Program of China(No.2016YFB1000101)the National Natural Science Foundation of China(Nos.61379052 and 61502513)+1 种基金the Natural Science Foundation for Distinguished Young Scholars of Hunan Province,China(No.14JJ1026)the Specialized Research Fund for the Doctoral Program of Higher Education,China(No.20124307110015)
文摘Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a multi-dimensional sequence over the data stream to satisfy the requirements of accuracy and high speed. It is because:(1) Redundant dimensions in sequence data and large state space lead to a poor ability for sequence modeling;(2) Anomaly detection cannot adapt to the high-speed nature of the data stream, especially when concept drift occurs, and it will reduce the detection rate. On one hand, most existing methods of sequence anomaly detection focus on the single-dimension sequence. On the other hand, some studies concerning multi-dimensional sequence concentrate mainly on the static database rather than the data stream. To improve the performance of anomaly detection for a multi-dimensional sequence over the data stream, we propose a novel unsupervised fast and accurate anomaly detection(FAAD) method which includes three algorithms. First, a method called "information calculation and minimum spanning tree cluster" is adopted to reduce redundant dimensions. Second, to speed up model construction and ensure the detection rate for the sequence over the data stream, we propose a method called"random sampling and subsequence partitioning based on the index probabilistic suffix tree." Last, the method called "anomaly buffer based on model dynamic adjustment" dramatically reduces the effects of concept drift in the data stream. FAAD is implemented on the streaming platform Storm to detect multi-dimensional log audit data.Compared with the existing anomaly detection methods, FAAD has a good performance in detection rate and speed without being affected by concept drift.
文摘Performance and energy consumption of high performance computing (HPC) interconnection networks have a great significance in the whole supercomputer, and building up HPC interconnection network simulation plat- form is very important for the research on HPC software and hardware technologies. To effectively evaluate the per- formance and energy consumption of HPC interconnection networks, this article designs and implements a detailed and clock-driven HPC interconnection network simulation plat- form, called HPC-NetSim. HPC-NetSim uses application- driven workloads and inherits the characteristics of the de- tailed and flexible cycle-accurate network simulator. Besides, it offers a large set of configurable network parameters in terms of topology and routing, and supports router's on/off states. We compare the simulated execution time with the real execution time of Tianhe-2 subsystem and the mean error is only 2.7%. In addition, we simulate the network behaviors with different network structures and low-power modes. The results are also consistent with the theoretical analyses.
基金The work was supported by the National Natural Science Foundation of China under Grant Nos. 61272142, 61103082, 61402492, 61170261, 61103193, the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA01A301, 2012AA010901, and the Program for New Century Excellent Talents in University of China.
文摘Determinism is very useful to multithreaded programs in debugging, testing, etc. Many deterministic ap- proaches have been proposed, such as deterministic multithreading (DMT) and deterministic replay. However, these sys- tems either are inefficient or target a single purpose, which is not flexible. In this paper, we propose an efficient and flexible deterministic framework for multithreaded programs. Our framework implements determinism in two steps: relaxed determinism and strong determinism. Relaxed determinism solves data races eificiently by using a proper weak memory consistency model. After that, we implement strong determinism by solving lock contentions deterministically. Since we can apply different approaches for these two steps independently, our framework provides a spectrum of deterministic choices, including nondeterministic system (fast), weak deterministic system (fast and conditionally deterministic), DMT system, and deternfinistic replay system. Our evaluation shows that the DMT configuration of this framework could even outperform a state-of-the-art DMT system.
基金This work was partially supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA01A301 and the National Natural Science Foundation of China under Grant No. 61120106005. Acknowledgements The Tianhe-2 project is a great team effort and benefits from the cooperation of many individuals at NUDT. We would like to thank the entire Tianhe-2 development, applications, and bench- marking teams, and all the people who have contributed to the system in a variety of ways.
文摘In this paper, we present the Tianhe-2 interconnect network and message passing services. We describe the architecture of the router and network interface chips, and highlight a set of hardware and software features effectively supporting high performance communications, ranging over remote direct memory access, collective optimization, hardwareenable reliable end-to-end communication, user-level message passing services, etc. Measured hardware performance results are also presented.
基金Acknowledgements This work was partially supported by the National High-tech R&D Program of China (863 Program) (2012AA01A301, 2013AA014301, 2013AA01A208), and by the National Basic Research Program of China (973 Program) (2011CB309705), and by the National Natural Science Foundation of China (Grant Nos. 61120106005, 61303063 and 61272482).
文摘Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interprocessot communications, and continuous efforts are devoted to the development of our proprietary interconnect. This paper describes the state-of-the-art of our proprietary interconnect, especially emphasizing on the design of network interface. Several key features are introduced, such as user-level communication, remote direct memory access, offload collective operation, and hardware reliable end-to-end communication, etc. The design of a low level message passing infrastructures and an upper message passing services are also proposed. The preliminary performance results demonstrate the efficiency of the TH interconnect interface.
基金Project supported by the National Key Research and Development Program of China(No.2017YFB0202104)。
文摘With the rapid increase of the size of applications and the complexity of the supercomputer architecture,topology-aware process mapping becomes increasingly important.High communication cost has become a dominant constraint of the performance of applications running on the supercomputer.To avoid a bad mapping strategy which can lead to terrible communication performance,we propose an optimized heuristic topology-aware mapping algorithm(OHTMA).The algorithm attempts to minimize the hop-byte metric that we use to measure the mapping results.OHTMA incorporates a new greedy heuristic method and pair-exchange-based optimization.It reduces the number of long-distance communications and effectively enhances the locality of the communication.Experimental results on the Tianhe-3 exascale supercomputer prototype indicate that OHTMA can significantly reduce the communication costs.
基金Acknowledgements This work was partially supported by National High-tech R&D Program of China (863 Program) (2012AA01A301, 2012AA010901), by Program for New Century Excellent Talents in University and by National Natural Science Foundation of China (Grant Nos. 61272142, 61103082, 61170261, and 61103193).
文摘With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay- 2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
基金supported by the Research Plan Project of National University of Defense Technology under Grant No.JC13-06-01the OCRit Project made possible by the Global Leadership Round in Genomics&Life Sciences Grant(GL2)
文摘Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.
基金supported by the National Numerical Windtunnel project(NNW2019ZT5-A10)the National Key Research and Development Program of China(2018YFB0204301,2017YFB0202104).
文摘Deep neural networks(DNNs)have recently shown great potential in solving partial differential equations(PDEs).The success of neural network-based surrogate models is attributed to their ability to learn a rich set of solution-related features.However,learning DNNs usually involves tedious training iterations to converge and requires a very large number of training data,which hinders the application of these models to complex physical contexts.To address this problem,we propose to apply the transfer learning approach to DNN-based PDE solving tasks.In our work,we create pairs of transfer experiments on Helmholtz and Navier-Stokes equations by constructing subtasks with different source terms and Reynolds numbers.We also conduct a series of experiments to investigate the degree of generality of the features between different equations.Our results demonstrate that despite differences in underlying PDE systems,the transfer methodology can lead to a significant improvement in the accuracy of the predicted solutions and achieve a maximum performance boost of 97.3%on widely used surrogate models.