The federated self-supervised framework is a distributed machine learning method that combines federated learning and self-supervised learning, which can effectively solve the problem of traditional federated learning...The federated self-supervised framework is a distributed machine learning method that combines federated learning and self-supervised learning, which can effectively solve the problem of traditional federated learning being difficult to process large-scale unlabeled data. The existing federated self-supervision framework has problems with low communication efficiency and high communication delay between clients and central servers. Therefore, we added edge servers to the federated self-supervision framework to reduce the pressure on the central server caused by frequent communication between both ends. A communication compression scheme using gradient quantization and sparsification was proposed to optimize the communication of the entire framework, and the algorithm of the sparse communication compression module was improved. Experiments have proved that the learning rate changes of the improved sparse communication compression module are smoother and more stable. Our communication compression scheme effectively reduced the overall communication overhead.展开更多
Efficient communication is important to every parallel algorithm. A parallel communication optimization is introduced into lattice Boltzmann method (LBM). It relies on a simplified communication strategy which is im...Efficient communication is important to every parallel algorithm. A parallel communication optimization is introduced into lattice Boltzmann method (LBM). It relies on a simplified communication strategy which is implemented by least square method. After testing the improved algorithm on parallel platform, the experimental results show that compared with normal parallel lattice Boltzmann algorithm, it provides better stability, higher performance while maintaining the same accuracy.展开更多
With large-scale development of distributed generation(DG) and its potential role in microgrids, the microgrid cluster(MGC) becomes a useful control model to assist the integration of DG. Considering that microgrids i...With large-scale development of distributed generation(DG) and its potential role in microgrids, the microgrid cluster(MGC) becomes a useful control model to assist the integration of DG. Considering that microgrids in a MGC, power dispatch optimization in a MGC is dif-ficult to achieve. In this paper, a hybrid interactive communication optimization solution(HICOS) is suggested based on flexible communication, which could be used to solve plug-in or plug-out operation states of microgrids in MGC power dispatch optimization. HICOS consists of a hierarchical architecture: the upper layer uses distributed control among multiple microgrids, with no central controller for the MGC, and the lower layer uses a central controller for each microgrid. Based on flexible communication links among microgrids, the optimal iterative information are exchanged among microgrids, thus HICOS would gradually converge to the global optimal solution.While some microgrids plug-in or plug-out, communication links will be changed, so as to unsuccessfully reach optimal solution. Differing from changeless communication links in traditional communication networks, HICOS redefines the topology of flexible communication links to meet the requirement to reach the global optimal solutions.Simulation studies show that HICOS could effectively reach the global optimal dispatch solution with non-MGC center. Especially, facing to microgrids plug-in or plug-out states, HICOS would also reach the global optimal solution based on refined communication link topology.展开更多
With the rapid growth of real-world graphs,the size of which can easily exceed the on-chip(board)storage capacity of an accelerator,processing large-scale graphs on a single Field Programmable Gate Array(FPGA)becomes ...With the rapid growth of real-world graphs,the size of which can easily exceed the on-chip(board)storage capacity of an accelerator,processing large-scale graphs on a single Field Programmable Gate Array(FPGA)becomes difficult.The multi-FPGA acceleration is of great necessity and importance.Many cloud providers(e.g.,Amazon,Microsoft,and Baidu)now expose FPGAs to users in their data centers,providing opportunities to accelerate large-scale graph processing.In this paper,we present a communication library,called FDGLib,which can easily scale out any existing single FPGA-based graph accelerator to a distributed version in a data center,with minimal hardware engineering efforts.FDGLib provides six APIs that can be easily used and integrated into any FPGA-based graph accelerator with only a few lines of code modifications.Considering the torus-based FPGA interconnection in data centers,FDGLib also improves communication efficiency using simple yet effective torus-friendly graph partition and placement schemes.We interface FDGLib into AccuGraph,a state-of-the-art graph accelerator.Our results on a 32-node Microsoft Catapult-like data center show that the distributed AccuGraph can be 2.32x and 4.77x faster than a state-of-the-art distributed FPGA-based graph accelerator ForeGraph and a distributed CPU-based graph system Gemini,with better scalability.展开更多
Complicated global climate problems trigger researchers from different scientific disciplines to link multiphysics simulations called models for integrated modeling of climate changes by using a software framework cal...Complicated global climate problems trigger researchers from different scientific disciplines to link multiphysics simulations called models for integrated modeling of climate changes by using a software framework called earth system modeling (ESM). As its critical component, coupler is in charge of connections and interactions among models. With the advance of next-generation models, greater data transfer volume and higher coupling frequency are expected to put heavy performance burden on coupler. High efficient coupling techniques are required. In this paper, we propose the sub-domain mapping method to improve the parallel coupling consisted of data transfer and data transformation. By using one specific interpolation oriented communication routing, the communication operations that are originally decentralized in various steps can be combined together for execution. This can reduce the redundant communications and the entailed synchronization costs. The tests on the Tianhe-lA (TH-1A) supercomputer show that our method can achieve 1.1 to 4.9 fold performance improve- ments. We also present further optimization solution for the multi-interpolation cases. The test results show that our method can achieve up to 3.4 fold speedup over the original coupling execution of the current climate system.展开更多
Shared Memory Processors (SMP) workstation clusters are becoming more and more popular. To optimize communication between the workstations, a new graph partition problem was developed to schedule tasks in SMP clusters...Shared Memory Processors (SMP) workstation clusters are becoming more and more popular. To optimize communication between the workstations, a new graph partition problem was developed to schedule tasks in SMP clusters. The problem is NP-complete and a heuristic algorithm was developed based on Lee, Kim and Park's algorithm. Experimental results indicate that our algorithm outperforms theirs, especially when the number of partitions is large. This algorithm can be integrated in a parallelizing compiler as a back end optimizer for the distributed code generator.展开更多
This paper discusses the compile time task scheduling of parallel program running on cluster of SMP workstations. Firstly, the problem is stated formally and transformed into a graph parti-tion problem and proved to b...This paper discusses the compile time task scheduling of parallel program running on cluster of SMP workstations. Firstly, the problem is stated formally and transformed into a graph parti-tion problem and proved to be NP-Complete. A heuristic algorithm MMP-Solver is then proposed to solve the problem. Experiment result shows that the task scheduling can reduce communication over-head of parallel applications greatly and MMP-Solver outperforms the existing algorithms.展开更多
Kinetic Monte Carlo (KMC) is a widely used method for studying the evolution of materials at the microcosmic level. At present, while there are many simulation software programs based on this algorithm, most focus o...Kinetic Monte Carlo (KMC) is a widely used method for studying the evolution of materials at the microcosmic level. At present, while there are many simulation software programs based on this algorithm, most focus on the verification of a certain phenomenon and have no analog-scale requirement, so many are serial in nature. The dynamic Monte Carlo algorithm is implemented using a parallel framework called SPPARKS, but Jt does not support the Embedded Atom Method (EAM) potential, which is commonly used in the dynamic simulation of metal materials. Metal material - the preferred material for most containers and components -- plays an important role in many fields, including construction engineering and transportation. In this paper, we propose and describe the development of a parallel software program called CrystaI-KMC, which is specifically used to simulate the lattice dynamics of metallic materials. This software uses MPI to achieve a parallel multiprocessing mode, which avoid the limitations of serial software in the analog scale. Finally, we describe the use of the paralleI-KMC simulation software CrystaI-KMC in simulating the diffusion of vacancies in iron, and analyze the experimental results. In addition, we tested the performance of CrystaI-KMC in "meta -Era" supercomputing clusters, and the results show the CrystaI-KMC parallel software to have good parallel speedup and scalability.展开更多
文摘The federated self-supervised framework is a distributed machine learning method that combines federated learning and self-supervised learning, which can effectively solve the problem of traditional federated learning being difficult to process large-scale unlabeled data. The existing federated self-supervision framework has problems with low communication efficiency and high communication delay between clients and central servers. Therefore, we added edge servers to the federated self-supervision framework to reduce the pressure on the central server caused by frequent communication between both ends. A communication compression scheme using gradient quantization and sparsification was proposed to optimize the communication of the entire framework, and the algorithm of the sparse communication compression module was improved. Experiments have proved that the learning rate changes of the improved sparse communication compression module are smoother and more stable. Our communication compression scheme effectively reduced the overall communication overhead.
基金Project supported by the National Natural Science Foundation of China(Grant No.11002086)the Shanghai Leading Academic Discipline Project(Grant No.J50103)
文摘Efficient communication is important to every parallel algorithm. A parallel communication optimization is introduced into lattice Boltzmann method (LBM). It relies on a simplified communication strategy which is implemented by least square method. After testing the improved algorithm on parallel platform, the experimental results show that compared with normal parallel lattice Boltzmann algorithm, it provides better stability, higher performance while maintaining the same accuracy.
基金funded by the State Grid Corporation of China project:Cooperative Simulation of Power Grid and Communication Gridthe National Natural Science Funds 51407030China Postdoctoral Science Foundation 121809
文摘With large-scale development of distributed generation(DG) and its potential role in microgrids, the microgrid cluster(MGC) becomes a useful control model to assist the integration of DG. Considering that microgrids in a MGC, power dispatch optimization in a MGC is dif-ficult to achieve. In this paper, a hybrid interactive communication optimization solution(HICOS) is suggested based on flexible communication, which could be used to solve plug-in or plug-out operation states of microgrids in MGC power dispatch optimization. HICOS consists of a hierarchical architecture: the upper layer uses distributed control among multiple microgrids, with no central controller for the MGC, and the lower layer uses a central controller for each microgrid. Based on flexible communication links among microgrids, the optimal iterative information are exchanged among microgrids, thus HICOS would gradually converge to the global optimal solution.While some microgrids plug-in or plug-out, communication links will be changed, so as to unsuccessfully reach optimal solution. Differing from changeless communication links in traditional communication networks, HICOS redefines the topology of flexible communication links to meet the requirement to reach the global optimal solutions.Simulation studies show that HICOS could effectively reach the global optimal dispatch solution with non-MGC center. Especially, facing to microgrids plug-in or plug-out states, HICOS would also reach the global optimal solution based on refined communication link topology.
基金supported by the National Key Research and Development Program of China under Grant No.2018YFB1003502the National Natural Science Foundation of China under Grant Nos.62072195,61825202,61832006,and 61628204.
文摘With the rapid growth of real-world graphs,the size of which can easily exceed the on-chip(board)storage capacity of an accelerator,processing large-scale graphs on a single Field Programmable Gate Array(FPGA)becomes difficult.The multi-FPGA acceleration is of great necessity and importance.Many cloud providers(e.g.,Amazon,Microsoft,and Baidu)now expose FPGAs to users in their data centers,providing opportunities to accelerate large-scale graph processing.In this paper,we present a communication library,called FDGLib,which can easily scale out any existing single FPGA-based graph accelerator to a distributed version in a data center,with minimal hardware engineering efforts.FDGLib provides six APIs that can be easily used and integrated into any FPGA-based graph accelerator with only a few lines of code modifications.Considering the torus-based FPGA interconnection in data centers,FDGLib also improves communication efficiency using simple yet effective torus-friendly graph partition and placement schemes.We interface FDGLib into AccuGraph,a state-of-the-art graph accelerator.Our results on a 32-node Microsoft Catapult-like data center show that the distributed AccuGraph can be 2.32x and 4.77x faster than a state-of-the-art distributed FPGA-based graph accelerator ForeGraph and a distributed CPU-based graph system Gemini,with better scalability.
文摘Complicated global climate problems trigger researchers from different scientific disciplines to link multiphysics simulations called models for integrated modeling of climate changes by using a software framework called earth system modeling (ESM). As its critical component, coupler is in charge of connections and interactions among models. With the advance of next-generation models, greater data transfer volume and higher coupling frequency are expected to put heavy performance burden on coupler. High efficient coupling techniques are required. In this paper, we propose the sub-domain mapping method to improve the parallel coupling consisted of data transfer and data transformation. By using one specific interpolation oriented communication routing, the communication operations that are originally decentralized in various steps can be combined together for execution. This can reduce the redundant communications and the entailed synchronization costs. The tests on the Tianhe-lA (TH-1A) supercomputer show that our method can achieve 1.1 to 4.9 fold performance improve- ments. We also present further optimization solution for the multi-interpolation cases. The test results show that our method can achieve up to 3.4 fold speedup over the original coupling execution of the current climate system.
文摘Shared Memory Processors (SMP) workstation clusters are becoming more and more popular. To optimize communication between the workstations, a new graph partition problem was developed to schedule tasks in SMP clusters. The problem is NP-complete and a heuristic algorithm was developed based on Lee, Kim and Park's algorithm. Experimental results indicate that our algorithm outperforms theirs, especially when the number of partitions is large. This algorithm can be integrated in a parallelizing compiler as a back end optimizer for the distributed code generator.
基金This work was supported by the National Natural Science Foundation of China (Grant No. 69933020) the "973" Program (Grant No. G1999032702).
文摘This paper discusses the compile time task scheduling of parallel program running on cluster of SMP workstations. Firstly, the problem is stated formally and transformed into a graph parti-tion problem and proved to be NP-Complete. A heuristic algorithm MMP-Solver is then proposed to solve the problem. Experiment result shows that the task scheduling can reduce communication over-head of parallel applications greatly and MMP-Solver outperforms the existing algorithms.
基金supported by the National Key R & D Program of China (Nos. 2017YFB0202003 and 2017YFB0202 104)
文摘Kinetic Monte Carlo (KMC) is a widely used method for studying the evolution of materials at the microcosmic level. At present, while there are many simulation software programs based on this algorithm, most focus on the verification of a certain phenomenon and have no analog-scale requirement, so many are serial in nature. The dynamic Monte Carlo algorithm is implemented using a parallel framework called SPPARKS, but Jt does not support the Embedded Atom Method (EAM) potential, which is commonly used in the dynamic simulation of metal materials. Metal material - the preferred material for most containers and components -- plays an important role in many fields, including construction engineering and transportation. In this paper, we propose and describe the development of a parallel software program called CrystaI-KMC, which is specifically used to simulate the lattice dynamics of metallic materials. This software uses MPI to achieve a parallel multiprocessing mode, which avoid the limitations of serial software in the analog scale. Finally, we describe the use of the paralleI-KMC simulation software CrystaI-KMC in simulating the diffusion of vacancies in iron, and analyze the experimental results. In addition, we tested the performance of CrystaI-KMC in "meta -Era" supercomputing clusters, and the results show the CrystaI-KMC parallel software to have good parallel speedup and scalability.