The possibility of carrying out a purely heterogeneous Heck reaction in practice without Pd leaching has been previously considered by a number of research groups but no general consent has yet arrived. Here, the reac...The possibility of carrying out a purely heterogeneous Heck reaction in practice without Pd leaching has been previously considered by a number of research groups but no general consent has yet arrived. Here, the reaction was, for the first time, evaluated by a simple computational approach. Modelling experiments were performed on one of the initial catalytic steps: phenyl halides attachment on Pd (111) to (100) and (111) to (111) ridges of a Pd crystal. Three surface structures of resulting were identified as possible reactive intermediates. Following potential energy minimisation calculations based on a universal force field, the relative stabilities of these surface species were then determined. Results showed the most stable species to be one in which a Pd ridge atom is removed from the Pd crystal structure, suggesting Pd leaching induced by phenyl halides is energetically favourable.展开更多
The problem of joint radio and cloud resources allocation is studied for heterogeneous mobile cloud computing networks. The objective of the proposed joint resource allocation schemes is to maximize the total utility ...The problem of joint radio and cloud resources allocation is studied for heterogeneous mobile cloud computing networks. The objective of the proposed joint resource allocation schemes is to maximize the total utility of users as well as satisfy the required quality of service(QoS) such as the end-to-end response latency experienced by each user. We formulate the problem of joint resource allocation as a combinatorial optimization problem. Three evolutionary approaches are considered to solve the problem: genetic algorithm(GA), ant colony optimization with genetic algorithm(ACO-GA), and quantum genetic algorithm(QGA). To decrease the time complexity, we propose a mapping process between the resource allocation matrix and the chromosome of GA, ACO-GA, and QGA, search the available radio and cloud resource pairs based on the resource availability matrixes for ACOGA, and encode the difference value between the allocated resources and the minimum resource requirement for QGA. Extensive simulation results show that our proposed methods greatly outperform the existing algorithms in terms of running time, the accuracy of final results, the total utility, resource utilization and the end-to-end response latency guaranteeing.展开更多
Particle-in-cell (PIC) method has got much benefits from GPU-accelerated heterogeneous systems.However,the performance of PIC is constrained by the interpolation operations in the weighting process on GPU (graphic pro...Particle-in-cell (PIC) method has got much benefits from GPU-accelerated heterogeneous systems.However,the performance of PIC is constrained by the interpolation operations in the weighting process on GPU (graphic processing unit).Aiming at this problem,a fast weighting method for PIC simulation on GPU-accelerated systems was proposed to avoid the atomic memory operations during the weighting process.The method was implemented by taking advantage of GPU's thread synchronization mechanism and dividing the problem space properly.Moreover,software managed shared memory on the GPU was employed to buffer the intermediate data.The experimental results show that the method achieves speedups up to 3.5 times compared to previous works,and runs 20.08 times faster on one NVIDIA Tesla M2090 GPU compared to a single core of Intel Xeon X5670 CPU.展开更多
The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible t...The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible to substantially accelerate calculations with hardware accelerators.Accordingly,this study develops a fast MC tool,called THUBrachy,which can be accelerated by several types of hardware accelerators.THUBrachy can simulate photons with energy less than 3 MeV and considers all photon interactions in the energy range.It was benchmarked against the American Association of Physicists in Medicine Task Group No.43 Report using a water phantom and validated with Geant4 using a clinical case.A performance test was conducted using the clinical case,showing that a multicore central processing unit,Intel Xeon Phi,and graphics processing unit(GPU)can efficiently accelerate the simulation.GPU-accelerated THUBrachy is the fastest version,which is 200 times faster than the serial version and approximately 500 times faster than Geant4.The proposed tool shows great potential for fast and accurate dose calculations in clinical applications.展开更多
In recent years,with the development of processor architecture,heterogeneous processors including Center processing unit(CPU)and Graphics processing unit(GPU)have become the mainstream.However,due to the differences o...In recent years,with the development of processor architecture,heterogeneous processors including Center processing unit(CPU)and Graphics processing unit(GPU)have become the mainstream.However,due to the differences of heterogeneous core,the heterogeneous system is now facing many problems that need to be solved.In order to solve these problems,this paper try to focus on the utilization and efficiency of heterogeneous core and design some reasonable resource scheduling strategies.To improve the performance of the system,this paper proposes a combination strategy for a single task and a multi-task scheduling strategy for multiple tasks.The combination strategy consists of two sub-strategies,the first strategy improves the execution efficiency of tasks on the GPU by changing the thread organization structure.The second focuses on the working state of the efficient core and develops more reasonable workload balancing schemes to improve resource utilization of heterogeneous systems.The multi-task scheduling strategy obtains the execution efficiency of heterogeneous cores and global task information through the processing of task samples.Based on this information,an improved ant colony algorithm is used to quickly obtain a reasonable task allocation scheme,which fully utilizes the characteristics of heterogeneous cores.The experimental results show that the combination strategy reduces task execution time by 29.13%on average.In the case of processing multiple tasks,the multi-task scheduling strategy reduces the execution time by up to 23.38%based on the combined strategy.Both strategies can make better use of the resources of heterogeneous systems and significantly reduce the execution time of tasks on heterogeneous systems.展开更多
Federated learning is an emerging machine learning techniquethat enables clients to collaboratively train a deep learning model withoutuploading raw data to the aggregation server. Each client may be equippedwith diff...Federated learning is an emerging machine learning techniquethat enables clients to collaboratively train a deep learning model withoutuploading raw data to the aggregation server. Each client may be equippedwith different computing resources for model training. The client equippedwith a lower computing capability requires more time for model training,resulting in a prolonged training time in federated learning. Moreover, it mayfail to train the entire model because of the out-of-memory issue. This studyaims to tackle these problems and propose the federated feature concatenate(FedFC) method for federated learning considering heterogeneous clients.FedFC leverages the model splitting and feature concatenate for offloadinga portion of the training loads from clients to the aggregation server. Eachclient in FedFC can collaboratively train a model with different cutting layers.Therefore, the specific features learned in the deeper layer of the serversidemodel are more identical for the data class classification. Accordingly,FedFC can reduce the computation loading for the resource-constrainedclient and accelerate the convergence time. The performance effectiveness isverified by considering different dataset scenarios, such as data and classimbalance for the participant clients in the experiments. The performanceimpacts of different cutting layers are evaluated during the model training.The experimental results show that the co-adapted features have a criticalimpact on the adequate classification of the deep learning model. Overall,FedFC not only shortens the convergence time, but also improves the bestaccuracy by up to 5.9% and 14.5% when compared to conventional federatedlearning and splitfed, respectively. In conclusion, the proposed approach isfeasible and effective for heterogeneous clients in federated learning.展开更多
Heterogeneous computing (HC) environment utilizes diverse resources with different computational capabilities to solve computing-intensive applications having diverse computational requirements and constraints. The ta...Heterogeneous computing (HC) environment utilizes diverse resources with different computational capabilities to solve computing-intensive applications having diverse computational requirements and constraints. The task assignment problem in HC environment can be formally defined as for a given set of tasks and machines, assigning tasks to machines to achieve the minimum makespan. In this paper we propose a new task scheduling heuristic, high standard deviation first (HSTDF), which considers the standard deviation of the expected execution time of a task as a selection criterion. Standard deviation of the ex- pected execution time of a task represents the amount of variation in task execution time on different machines. Our conclusion is that tasks having high standard deviation must be assigned first for scheduling. A large number of experiments were carried out to check the effectiveness of the proposed heuristic in different scenarios, and the comparison with the existing heuristics (Max-min, Sufferage, Segmented Min-average, Segmented Min-min, and Segmented Max-min) clearly reveals that the proposed heuristic outperforms all existing heuristics in terms of average makespan.展开更多
Task scheduling determines the performance of NOW computing to a large extent. However, the computer system architecture, computing capability and system load are rarely proposed together. In this paper, a biggest het...Task scheduling determines the performance of NOW computing to a large extent. However, the computer system architecture, computing capability and system load are rarely proposed together. In this paper, a biggest heterogeneous scheduling algorithm is presented. It fully considers the system characteristics (from application view), structure and state. So it always can utilize all processing resource under a reasonable premise. The results of experiment show the algorithm can significantly shorten the response time of jobs.展开更多
Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,...Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,despite using an expensive high-end server.Heterogeneous computing,a combination of the Field Programmable Gate Array(FPGA)and a computer,is proposed as a solution to compute MD simulation efficiently.In such heterogeneous computation,communication between FPGA and Computer is necessary.One such MD simulation,explained in the paper,is the(Artificial Neural Network)ANN-based IAP computation of gold(Au_(147)&Au_(309))nanoparticles.MD simulation calculates the forces between atoms and the total energy of the chemical system.This work proposes the novel design and implementation of an ANN IAP-based MD simulation for Au_(147)&Au_(309) using communication protocols,such as Universal Asynchronous Receiver-Transmitter(UART)and Ethernet,for communication between the FPGA and the host computer.To improve the latency of MD simulation through heterogeneous computing,Universal Asynchronous Receiver-Transmitter(UART)and Ethernet communication protocols were explored to conduct MD simulation of 50,000 cycles.In this study,computation times of 17.54 and 18.70 h were achieved with UART and Ethernet,respectively,compared to the conventional server time of 29 h for Au_(147) nanoparticles.The results pave the way for the development of a Lab-on-a-chip application.展开更多
Graphics Processing Units(GPUs)are used to accelerate computing-intensive tasks,such as neural networks,data analysis,high-performance computing,etc.In the past decade or so,researchers have done a lot of work on GPU ...Graphics Processing Units(GPUs)are used to accelerate computing-intensive tasks,such as neural networks,data analysis,high-performance computing,etc.In the past decade or so,researchers have done a lot of work on GPU architecture and proposed a variety of theories and methods to study the microarchitectural characteristics of various GPUs.In this study,the GPU serves as a co-processor and works together with the CPU in an embedded real-time system to handle computationally intensive tasks.It models the architecture of the GPU and further considers it based on some excellent work.The SIMT mechanism and Cache-miss situation provide a more detailed analysis of the GPU architecture.In order to verify the GPU architecture model proposed in this article,10 GPU kernel_task and an Nvidia GPU device were used to perform experiments.The experimental results showed that the minimum error between the kernel task execution time predicted by the GPU architecture model proposed in this article and the actual measured kernel task execution time was 3.80%,and the maximum error was 8.30%.展开更多
Most natural resources are processed as particle-fluid multiphase systems in chemical,mineral and material indus-tries,therefore,discrete particles methods(DPM)are reasonable choices of simulation method for engineeri...Most natural resources are processed as particle-fluid multiphase systems in chemical,mineral and material indus-tries,therefore,discrete particles methods(DPM)are reasonable choices of simulation method for engineering the relevant processes and equipments.However,direct application of these methods is challenged by the complex multiscale behavior of such systems,which leads to enormous computational cost or otherwise qualitatively inac-curate description of the mesoscale structures.The coarse-grained DPM based on the energy-minimization multi-scale(EMMS)model,or EMMS-DPM,was proposed to reduce the computational cost by several orders while main-taining an accurate description of the mesoscale structures,which paves the way for its engineering applications.Further empowered by the high-efficiency multi-scale DEM software DEMms and the corresponding customized heterogeneous supercomputing facilities with graphics processing units(GPUs),it may even approach realtime simulation of industrial reactors.This short review will introduce the principle of DPM,in particular,EMMS-DPM,and the recent developments in modeling,numerical implementation and application of large-scale DPM which aims to reach industrial scale on one hand and resolves mesoscale structures critical to reaction-transport coupling on the other hand.This review finally prospects on the future developments of DPM in this direction.展开更多
Parallel computing techniques have been introduced into digital image correlation(DIC) in recent years and leads to a surge in computation speed. The graphics processing unit(GPU)-based parallel computing demonstrated...Parallel computing techniques have been introduced into digital image correlation(DIC) in recent years and leads to a surge in computation speed. The graphics processing unit(GPU)-based parallel computing demonstrated a surprising effect on accelerating the iterative subpixel DIC, compared with CPU-based parallel computing. In this paper, the performances of the two kinds of parallel computing techniques are compared for the previously proposed path-independent DIC method, in which the initial guess for the inverse compositional Gauss-Newton(IC-GN) algorithm at each point of interest(POI) is estimated through the fast Fourier transform-based cross-correlation(FFT-CC) algorithm. Based on the performance evaluation, a heterogeneous parallel computing(HPC) model is proposed with hybrid mode of parallelisms in order to combine the computing power of GPU and multicore CPU. A scheme of trial computation test is developed to optimize the configuration of the HPC model on a specific computer. The proposed HPC model shows excellent performance on a middle-end desktop computer for real-time subpixel DIC with high resolution of more than 10000 POIs per frame.展开更多
As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In...As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In this paper,we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000,which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization.To assist its software development,we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler.Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000,while the high-level model allows programmers to use the OpenCL programming standard.We detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal accelerators.Our programming models have been deployed in the production environment of an exascale prototype system.展开更多
To reduce the running time of network simulation in heterogeneous computing environment,a network simulation task partition method,named LBPHCE,is put forward.In this method,the network simulation task is partitioned ...To reduce the running time of network simulation in heterogeneous computing environment,a network simulation task partition method,named LBPHCE,is put forward.In this method,the network simulation task is partitioned in comprehensive consideration of the load balance of both routing computing simulation and packet forwarding simulation.First,through benchmark experiments,the computation ability and routing simulation ability of each simulation machine are measured in the heterogeneous computing environment.Second,based on the computation ability of each simulation machine,the network simulation task is initially partitioned to meet the load balance of packet forwarding simulation in the heterogeneous computing environment,and then according to the routing computation ability,the scale of each partition is fine-tuned to satisfy the balance of the routing computing simulation,meanwhile the load balance of packet forwarding simulation is guaranteed.Experiments based on PDNS indicate that,compared to traditional uniform partition method,the LBPHCE method can reduce the total simulation running time by 26.3%in average,and compared to the liner partition method,it can reduce the running time by 18.3%in average.展开更多
Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming ta...Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming task in short-ranged MD simulations is the computation of particle-to-particle interac- tions. Beyond a certain distance, these interactions decrease to zero. To minimize the operations to investi- gate distance, previous works have tiled interactions by employing the spatial attribute, which increases the memory access and GPU computations, hence decreasing performance. Other studies ignore the spatial attribute and construct an all-versus-all interaction matrix, which has poor scalability. This paper presents an improved algorithm. The algorithm first bins particles into voxels according to the spatial attributes, and then tiles the all-versus-all matrix into voxel-versus-voxel sub-matrixes. Only the sub-matrixes between neighbor- ing voxels are computed on the GPU. Therefore, the algorithm reduces the distance examine operations and limits additional memory access and GPU computations. This paper also adopts a multi-level program- ming model to implement the algorithm on multi-nodes of Tianhe-lA. By employing (1) a patch design to ex- ploit parallelism across the simulation domain, (2) a communication overlapping method to overlap the communications between CPUs and GPUs, and (3) a dynamic workload balancing method to adjust the workloads among compute nodes, the implementation achieves a speedup of 4.16x on one NVIDIA Tesla M2050 GPU compared to a 2.93 GHz six-core Intel Xeon X5670 CPU. In addition, it runs 2.41x faster on 256 compute nodes of Tianhe-lA (with two CPUs and one GPU inside a node) than on 256 GPU-excluded nodes.展开更多
Implicit coscheduling techniques applied to non-dedicated homogeneous Networks Of Workstations (NOWs) have shown they can perform well when many local users compete with a single parallel job. Implicit coscheduling ...Implicit coscheduling techniques applied to non-dedicated homogeneous Networks Of Workstations (NOWs) have shown they can perform well when many local users compete with a single parallel job. Implicit coscheduling deals with minimizing the communication waiting time of parallel processes by identifying the processes in need of coscheduling through gathering and analyzing implicit runtime information, basically communication events. Unfortunately, implicit coscheduling techniques do not guarantee the performance of local and parallel jobs, when the number of parallel jobs competing against each other is increased. Thus, a low efficiency use of the idle computational resources is achieved. In order to solve these problems, a new technique, named Cooperating CoScheduling (CCS), is presented in this work. Unlike traditional implicit coscheduling techniques, under CCS, each node takes its scheduling decisions from the occurrence of local events, basically communication, memory, Input/Output and CPU, together with foreign events received from cooperating nodes. This allows CCS to provide a social contract based on reserving a percentage of CPU and memory resources to ensure the progress of parallel jobs without disturbing the local users, while coscheduling of communicating tasks is ensured. Besides, the CCS algorithm uses status information from the cooperating nodes to balance the resources across the cluster when necessary. Experimental results in a non-dedicated heterogeneous NOW reveal that CCS allows the idle resources to be exploited efficiently, thus obtaining a satisfactory speedup and provoking an overhead that is imperceptible to the local user.展开更多
.The geometric multigrid method(GMG)is one of the most efficient solving techniques for discrete algebraic systems arising from elliptic partial differential equations.GMG utilizes a hierarchy of grids or discretizati....The geometric multigrid method(GMG)is one of the most efficient solving techniques for discrete algebraic systems arising from elliptic partial differential equations.GMG utilizes a hierarchy of grids or discretizations and reduces the error at a number of frequencies simultaneously.Graphics processing units(GPUs)have recently burst onto the scientific computing scene as a technology that has yielded substantial performance and energy-efficiency improvements.A central challenge in implementing GMG on GPUs,though,is that computational work on coarse levels cannot fully utilize the capacity of a GPU.In this work,we perform numerical studies of GMG on CPU–GPU heterogeneous computers.Furthermore,we compare our implementation with an efficient CPU implementation of GMG and with the most popular fast Poisson solver,Fast Fourier Transform,in the cuFFT library developed by NVIDIA.展开更多
The widespread application of heterogeneous cloud computing has enabled enormous advances in the real-time performance of telehealth systems.A cloud-based telehealth system allows healthcare users to obtain medical da...The widespread application of heterogeneous cloud computing has enabled enormous advances in the real-time performance of telehealth systems.A cloud-based telehealth system allows healthcare users to obtain medical data from various data sources supported by heterogeneous cloud providers.Employing data duplications in distributed cloud databases is an alternative approach for achieving data sharing among multiple data users.However,this approach results in additional storage space being used,even though reducing data duplications would lead to a decrease in data acquisitions and real-time performance.To address this issue,this paper focuses on developing a dynamic data deduplication method that uses an intelligent blocker to determine the working mode of data duplications for each data package in heterogeneous cloud-based telehealth systems.The proposed approach is named the SD2M(Smart Data Deduplication Model),in which the main algorithm applies dynamic programming to produce optimal solutions to minimizing the total cost of data usage.We implement experimental evaluations to examine the adaptability of the proposed approach.展开更多
With computing systems undergone a fundamen- tal transformation from single-processor devices at the turn of the century to the ubiquitous and networked devices and the warehouse-scale computing via the cloud, the par...With computing systems undergone a fundamen- tal transformation from single-processor devices at the turn of the century to the ubiquitous and networked devices and the warehouse-scale computing via the cloud, the parallelism has become ubiquitous at many levels. At micro level, par- allelisms are being explored from the underlying circuits, to pipelining and instruction level parallelism on multi-cores or many cores on a chip as well as in a machine. From macro level, parallelisms are being promoted from multiple ma- chines on a rack, many racks in a data center, to the glob- ally shared infrastructure of the Internet. With the push of big data, we are entering a new era of parallel computing driven by novel and ground breaking research innovation on elas- tic parallelism and scalability. In this paper, we will give an overview of computing infrastructure for big data processing, focusing on architectural, storage and networking challenges of supporting big data paper. We will briefly discuss emerging computing infrastructure and technologies that are promising for improving data parallelism, task parallelism and encour- aging vertical and horizontal computation parallelism.展开更多
Nowadays,the management of resource contention in shared cloud remains a pending problem.The evolution and deployment of new application paradigms(e.g.,deep learning training and microservices)and custom hardware(e.g....Nowadays,the management of resource contention in shared cloud remains a pending problem.The evolution and deployment of new application paradigms(e.g.,deep learning training and microservices)and custom hardware(e.g.,graphics processing unit(GPU)and tensor processing unit(TPU))have posed new challenges in resource management system design.Current solutions tend to trade cluster efficiency for guaranteed application performance,e.g.,resource over-allocation,leaving a lot of resources underutilized.Overcoming this dilemma is not easy,because different components across the software stack are involved.Nevertheless,massive efforts have been devoted to seeking effective performance isolation and highly efficient resource scheduling.The goal of this paper is to systematically cover related aspects to deliver the techniques from the coordination perspective,and to identify the corresponding trends they indicate.Briefly,four topics are involved.First,isolation mechanisms deployed at different levels(micro-architecture,system,and virtualization levels)are reviewed,including GPU multitasking methods.Second,resource scheduling techniques within an individual machine and at the cluster level are investigated,respectively.Particularly,GPU scheduling for deep learning applications is described in detail.Third,adaptive resource management including the latest microservice-related research is thoroughly explored.Finally,future research directions are discussed in the light of advanced work.We hope that this review paper will help researchers establish a global view of the landscape of resource management techniques in shared cloud,and see technology trends more clearly.展开更多
文摘The possibility of carrying out a purely heterogeneous Heck reaction in practice without Pd leaching has been previously considered by a number of research groups but no general consent has yet arrived. Here, the reaction was, for the first time, evaluated by a simple computational approach. Modelling experiments were performed on one of the initial catalytic steps: phenyl halides attachment on Pd (111) to (100) and (111) to (111) ridges of a Pd crystal. Three surface structures of resulting were identified as possible reactive intermediates. Following potential energy minimisation calculations based on a universal force field, the relative stabilities of these surface species were then determined. Results showed the most stable species to be one in which a Pd ridge atom is removed from the Pd crystal structure, suggesting Pd leaching induced by phenyl halides is energetically favourable.
基金supported by the National Natural Science Foundation of China (No. 61741102, No. 61471164)China Scholarship Council
文摘The problem of joint radio and cloud resources allocation is studied for heterogeneous mobile cloud computing networks. The objective of the proposed joint resource allocation schemes is to maximize the total utility of users as well as satisfy the required quality of service(QoS) such as the end-to-end response latency experienced by each user. We formulate the problem of joint resource allocation as a combinatorial optimization problem. Three evolutionary approaches are considered to solve the problem: genetic algorithm(GA), ant colony optimization with genetic algorithm(ACO-GA), and quantum genetic algorithm(QGA). To decrease the time complexity, we propose a mapping process between the resource allocation matrix and the chromosome of GA, ACO-GA, and QGA, search the available radio and cloud resource pairs based on the resource availability matrixes for ACOGA, and encode the difference value between the allocated resources and the minimum resource requirement for QGA. Extensive simulation results show that our proposed methods greatly outperform the existing algorithms in terms of running time, the accuracy of final results, the total utility, resource utilization and the end-to-end response latency guaranteeing.
基金Projects(61170049,60903044)supported by National Natural Science Foundation of ChinaProject(2012AA010903)supported by National High Technology Research and Development Program of China
文摘Particle-in-cell (PIC) method has got much benefits from GPU-accelerated heterogeneous systems.However,the performance of PIC is constrained by the interpolation operations in the weighting process on GPU (graphic processing unit).Aiming at this problem,a fast weighting method for PIC simulation on GPU-accelerated systems was proposed to avoid the atomic memory operations during the weighting process.The method was implemented by taking advantage of GPU's thread synchronization mechanism and dividing the problem space properly.Moreover,software managed shared memory on the GPU was employed to buffer the intermediate data.The experimental results show that the method achieves speedups up to 3.5 times compared to previous works,and runs 20.08 times faster on one NVIDIA Tesla M2090 GPU compared to a single core of Intel Xeon X5670 CPU.
基金supported by the National Natural Science Foundation of China(No.11875036)。
文摘The Monte Carlo(MC)simulation is regarded as the gold standard for dose calculation in brachytherapy,but it consumes a large amount of computing resources.The development of heterogeneous computing makes it possible to substantially accelerate calculations with hardware accelerators.Accordingly,this study develops a fast MC tool,called THUBrachy,which can be accelerated by several types of hardware accelerators.THUBrachy can simulate photons with energy less than 3 MeV and considers all photon interactions in the energy range.It was benchmarked against the American Association of Physicists in Medicine Task Group No.43 Report using a water phantom and validated with Geant4 using a clinical case.A performance test was conducted using the clinical case,showing that a multicore central processing unit,Intel Xeon Phi,and graphics processing unit(GPU)can efficiently accelerate the simulation.GPU-accelerated THUBrachy is the fastest version,which is 200 times faster than the serial version and approximately 500 times faster than Geant4.The proposed tool shows great potential for fast and accurate dose calculations in clinical applications.
基金This work is supported by Beijing Natural Science Foundation[4192007]the National Natural Science Foundation of China[61202076]Beijing University of Technology Project No.2021C02.
文摘In recent years,with the development of processor architecture,heterogeneous processors including Center processing unit(CPU)and Graphics processing unit(GPU)have become the mainstream.However,due to the differences of heterogeneous core,the heterogeneous system is now facing many problems that need to be solved.In order to solve these problems,this paper try to focus on the utilization and efficiency of heterogeneous core and design some reasonable resource scheduling strategies.To improve the performance of the system,this paper proposes a combination strategy for a single task and a multi-task scheduling strategy for multiple tasks.The combination strategy consists of two sub-strategies,the first strategy improves the execution efficiency of tasks on the GPU by changing the thread organization structure.The second focuses on the working state of the efficient core and develops more reasonable workload balancing schemes to improve resource utilization of heterogeneous systems.The multi-task scheduling strategy obtains the execution efficiency of heterogeneous cores and global task information through the processing of task samples.Based on this information,an improved ant colony algorithm is used to quickly obtain a reasonable task allocation scheme,which fully utilizes the characteristics of heterogeneous cores.The experimental results show that the combination strategy reduces task execution time by 29.13%on average.In the case of processing multiple tasks,the multi-task scheduling strategy reduces the execution time by up to 23.38%based on the combined strategy.Both strategies can make better use of the resources of heterogeneous systems and significantly reduce the execution time of tasks on heterogeneous systems.
基金supported by the National Science and Technology Council (NSTC)of Taiwan under Grants 108-2218-E-033-008-MY3,110-2634-F-A49-005,111-2221-E-033-033the Veterans General Hospitals and University System of Taiwan Joint Research Program under Grant VGHUST111-G6-5-1.
文摘Federated learning is an emerging machine learning techniquethat enables clients to collaboratively train a deep learning model withoutuploading raw data to the aggregation server. Each client may be equippedwith different computing resources for model training. The client equippedwith a lower computing capability requires more time for model training,resulting in a prolonged training time in federated learning. Moreover, it mayfail to train the entire model because of the out-of-memory issue. This studyaims to tackle these problems and propose the federated feature concatenate(FedFC) method for federated learning considering heterogeneous clients.FedFC leverages the model splitting and feature concatenate for offloadinga portion of the training loads from clients to the aggregation server. Eachclient in FedFC can collaboratively train a model with different cutting layers.Therefore, the specific features learned in the deeper layer of the serversidemodel are more identical for the data class classification. Accordingly,FedFC can reduce the computation loading for the resource-constrainedclient and accelerate the convergence time. The performance effectiveness isverified by considering different dataset scenarios, such as data and classimbalance for the participant clients in the experiments. The performanceimpacts of different cutting layers are evaluated during the model training.The experimental results show that the co-adapted features have a criticalimpact on the adequate classification of the deep learning model. Overall,FedFC not only shortens the convergence time, but also improves the bestaccuracy by up to 5.9% and 14.5% when compared to conventional federatedlearning and splitfed, respectively. In conclusion, the proposed approach isfeasible and effective for heterogeneous clients in federated learning.
基金Project supported by the National Natural Science Foundation of China (No. 60703012)the National Basic Research Program (973) of China (No. 2006CB303000)the Heilongjiang Provincial Scientific and Technological Special Fund for Young Scholars (No. QC06C033),China
文摘Heterogeneous computing (HC) environment utilizes diverse resources with different computational capabilities to solve computing-intensive applications having diverse computational requirements and constraints. The task assignment problem in HC environment can be formally defined as for a given set of tasks and machines, assigning tasks to machines to achieve the minimum makespan. In this paper we propose a new task scheduling heuristic, high standard deviation first (HSTDF), which considers the standard deviation of the expected execution time of a task as a selection criterion. Standard deviation of the ex- pected execution time of a task represents the amount of variation in task execution time on different machines. Our conclusion is that tasks having high standard deviation must be assigned first for scheduling. A large number of experiments were carried out to check the effectiveness of the proposed heuristic in different scenarios, and the comparison with the existing heuristics (Max-min, Sufferage, Segmented Min-average, Segmented Min-min, and Segmented Max-min) clearly reveals that the proposed heuristic outperforms all existing heuristics in terms of average makespan.
文摘Task scheduling determines the performance of NOW computing to a large extent. However, the computer system architecture, computing capability and system load are rarely proposed together. In this paper, a biggest heterogeneous scheduling algorithm is presented. It fully considers the system characteristics (from application view), structure and state. So it always can utilize all processing resource under a reasonable premise. The results of experiment show the algorithm can significantly shorten the response time of jobs.
文摘Molecular Dynamics(MD)simulation for computing Interatomic Potential(IAP)is a very important High-Performance Computing(HPC)application.MD simulation on particles of experimental relevance takes huge computation time,despite using an expensive high-end server.Heterogeneous computing,a combination of the Field Programmable Gate Array(FPGA)and a computer,is proposed as a solution to compute MD simulation efficiently.In such heterogeneous computation,communication between FPGA and Computer is necessary.One such MD simulation,explained in the paper,is the(Artificial Neural Network)ANN-based IAP computation of gold(Au_(147)&Au_(309))nanoparticles.MD simulation calculates the forces between atoms and the total energy of the chemical system.This work proposes the novel design and implementation of an ANN IAP-based MD simulation for Au_(147)&Au_(309) using communication protocols,such as Universal Asynchronous Receiver-Transmitter(UART)and Ethernet,for communication between the FPGA and the host computer.To improve the latency of MD simulation through heterogeneous computing,Universal Asynchronous Receiver-Transmitter(UART)and Ethernet communication protocols were explored to conduct MD simulation of 50,000 cycles.In this study,computation times of 17.54 and 18.70 h were achieved with UART and Ethernet,respectively,compared to the conventional server time of 29 h for Au_(147) nanoparticles.The results pave the way for the development of a Lab-on-a-chip application.
文摘Graphics Processing Units(GPUs)are used to accelerate computing-intensive tasks,such as neural networks,data analysis,high-performance computing,etc.In the past decade or so,researchers have done a lot of work on GPU architecture and proposed a variety of theories and methods to study the microarchitectural characteristics of various GPUs.In this study,the GPU serves as a co-processor and works together with the CPU in an embedded real-time system to handle computationally intensive tasks.It models the architecture of the GPU and further considers it based on some excellent work.The SIMT mechanism and Cache-miss situation provide a more detailed analysis of the GPU architecture.In order to verify the GPU architecture model proposed in this article,10 GPU kernel_task and an Nvidia GPU device were used to perform experiments.The experimental results showed that the minimum error between the kernel task execution time predicted by the GPU architecture model proposed in this article and the actual measured kernel task execution time was 3.80%,and the maximum error was 8.30%.
基金supported by the National Natural Sci-ence Foundation of China(Grant Nos.21978295,22078330,92034302 and 91834303)Innovation Academy for Green Manufacture,Chinese Academy of Sciences(Grant Nos.IAGM-2019-A03 and IAGM-2019-A13)+2 种基金Key Research Program of Frontier Sciences,Chinese Academy of Sciences(Grant No.QYZDJ-SSWJSC029)“Transformational Technologies for Clean Energy and Demonstration”Strategic Prior-ity Research Program of the Chinese Academy of Sciences(Grant No.XDA21030700)the Youth Innovation Promotion Association,Chinese Academy of Sciences(Grant No.2019050).
文摘Most natural resources are processed as particle-fluid multiphase systems in chemical,mineral and material indus-tries,therefore,discrete particles methods(DPM)are reasonable choices of simulation method for engineering the relevant processes and equipments.However,direct application of these methods is challenged by the complex multiscale behavior of such systems,which leads to enormous computational cost or otherwise qualitatively inac-curate description of the mesoscale structures.The coarse-grained DPM based on the energy-minimization multi-scale(EMMS)model,or EMMS-DPM,was proposed to reduce the computational cost by several orders while main-taining an accurate description of the mesoscale structures,which paves the way for its engineering applications.Further empowered by the high-efficiency multi-scale DEM software DEMms and the corresponding customized heterogeneous supercomputing facilities with graphics processing units(GPUs),it may even approach realtime simulation of industrial reactors.This short review will introduce the principle of DPM,in particular,EMMS-DPM,and the recent developments in modeling,numerical implementation and application of large-scale DPM which aims to reach industrial scale on one hand and resolves mesoscale structures critical to reaction-transport coupling on the other hand.This review finally prospects on the future developments of DPM in this direction.
基金supported by the National Natural Science Foundation of China(Grant Nos.11772131,11772132,11772134&11472109)the Natural Science Foundation of Guangdong Province,China(Grant Nos.2015A030308017,2015A030311046&2015B010131009)+2 种基金the Opening fund of State Key Laboratory of Nonlinear Mechanics(LNM)CASthe State Key Lab of Subtropical Building Science,South China University of Technology(Grant Nos.2014ZC17&2017ZD096)
文摘Parallel computing techniques have been introduced into digital image correlation(DIC) in recent years and leads to a surge in computation speed. The graphics processing unit(GPU)-based parallel computing demonstrated a surprising effect on accelerating the iterative subpixel DIC, compared with CPU-based parallel computing. In this paper, the performances of the two kinds of parallel computing techniques are compared for the previously proposed path-independent DIC method, in which the initial guess for the inverse compositional Gauss-Newton(IC-GN) algorithm at each point of interest(POI) is estimated through the fast Fourier transform-based cross-correlation(FFT-CC) algorithm. Based on the performance evaluation, a heterogeneous parallel computing(HPC) model is proposed with hybrid mode of parallelisms in order to combine the computing power of GPU and multicore CPU. A scheme of trial computation test is developed to optimize the configuration of the HPC model on a specific computer. The proposed HPC model shows excellent performance on a middle-end desktop computer for real-time subpixel DIC with high resolution of more than 10000 POIs per frame.
基金Project supported by the National Key Research and Development Program of China(No.2021YFB0300101)the National Natural Science Foundation of China(No.61972408)the UK Royal Society International Collaboration Grant。
文摘As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In this paper,we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000,which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization.To assist its software development,we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler.Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000,while the high-level model allows programmers to use the OpenCL programming standard.We detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal accelerators.Our programming models have been deployed in the production environment of an exascale prototype system.
基金supported by the National Natural Science Foundation of China(Grant No.61103223)the Natural Science Foundation of Jiangsu Province(No.BK2011003).
文摘To reduce the running time of network simulation in heterogeneous computing environment,a network simulation task partition method,named LBPHCE,is put forward.In this method,the network simulation task is partitioned in comprehensive consideration of the load balance of both routing computing simulation and packet forwarding simulation.First,through benchmark experiments,the computation ability and routing simulation ability of each simulation machine are measured in the heterogeneous computing environment.Second,based on the computation ability of each simulation machine,the network simulation task is initially partitioned to meet the load balance of packet forwarding simulation in the heterogeneous computing environment,and then according to the routing computation ability,the scale of each partition is fine-tuned to satisfy the balance of the routing computing simulation,meanwhile the load balance of packet forwarding simulation is guaranteed.Experiments based on PDNS indicate that,compared to traditional uniform partition method,the LBPHCE method can reduce the total simulation running time by 26.3%in average,and compared to the liner partition method,it can reduce the running time by 18.3%in average.
基金Supported by the National Natural Science Foundation of China (Nos. 61170049 and 60903044)the National High-Tech Research and Development (863) Program of China (Nos. 2012AA01A301 and 2012AA010903)
文摘Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming task in short-ranged MD simulations is the computation of particle-to-particle interac- tions. Beyond a certain distance, these interactions decrease to zero. To minimize the operations to investi- gate distance, previous works have tiled interactions by employing the spatial attribute, which increases the memory access and GPU computations, hence decreasing performance. Other studies ignore the spatial attribute and construct an all-versus-all interaction matrix, which has poor scalability. This paper presents an improved algorithm. The algorithm first bins particles into voxels according to the spatial attributes, and then tiles the all-versus-all matrix into voxel-versus-voxel sub-matrixes. Only the sub-matrixes between neighbor- ing voxels are computed on the GPU. Therefore, the algorithm reduces the distance examine operations and limits additional memory access and GPU computations. This paper also adopts a multi-level program- ming model to implement the algorithm on multi-nodes of Tianhe-lA. By employing (1) a patch design to ex- ploit parallelism across the simulation domain, (2) a communication overlapping method to overlap the communications between CPUs and GPUs, and (3) a dynamic workload balancing method to adjust the workloads among compute nodes, the implementation achieves a speedup of 4.16x on one NVIDIA Tesla M2050 GPU compared to a 2.93 GHz six-core Intel Xeon X5670 CPU. In addition, it runs 2.41x faster on 256 compute nodes of Tianhe-lA (with two CPUs and one GPU inside a node) than on 256 GPU-excluded nodes.
基金This work was supported by the MEyC under Grant No.TIN 2004-03388.
文摘Implicit coscheduling techniques applied to non-dedicated homogeneous Networks Of Workstations (NOWs) have shown they can perform well when many local users compete with a single parallel job. Implicit coscheduling deals with minimizing the communication waiting time of parallel processes by identifying the processes in need of coscheduling through gathering and analyzing implicit runtime information, basically communication events. Unfortunately, implicit coscheduling techniques do not guarantee the performance of local and parallel jobs, when the number of parallel jobs competing against each other is increased. Thus, a low efficiency use of the idle computational resources is achieved. In order to solve these problems, a new technique, named Cooperating CoScheduling (CCS), is presented in this work. Unlike traditional implicit coscheduling techniques, under CCS, each node takes its scheduling decisions from the occurrence of local events, basically communication, memory, Input/Output and CPU, together with foreign events received from cooperating nodes. This allows CCS to provide a social contract based on reserving a percentage of CPU and memory resources to ensure the progress of parallel jobs without disturbing the local users, while coscheduling of communicating tasks is ensured. Besides, the CCS algorithm uses status information from the cooperating nodes to balance the resources across the cluster when necessary. Experimental results in a non-dedicated heterogeneous NOW reveal that CCS allows the idle resources to be exploited efficiently, thus obtaining a satisfactory speedup and provoking an overhead that is imperceptible to the local user.
基金the assistance provided by Mr.Xiaoqiang Yue and Mr.Zheng Li from Xiangtan University in regard in our numerical experiments.Feng is partially supported by the NSFC Grant 11201398Program for Changjiang Scholars and Innovative Research Team in University of China Grant IRT1179+4 种基金Specialized research Fund for the Doctoral Program of Higher Education of China Grant 20124301110003Shu is partially supported by NSFC Grant 91130002 and 11171281the Scientific Research Fund of the Hunan Provincial Education Department of China Grant 12A138Xu is partially supported by NSFC Grant 91130011 and NSF DMS-1217142.Zhang is partially supported by the Dean Startup Fund,Academy of Mathematics and System Sciences,and by NSFC Grant 91130011.
文摘.The geometric multigrid method(GMG)is one of the most efficient solving techniques for discrete algebraic systems arising from elliptic partial differential equations.GMG utilizes a hierarchy of grids or discretizations and reduces the error at a number of frequencies simultaneously.Graphics processing units(GPUs)have recently burst onto the scientific computing scene as a technology that has yielded substantial performance and energy-efficiency improvements.A central challenge in implementing GMG on GPUs,though,is that computational work on coarse levels cannot fully utilize the capacity of a GPU.In this work,we perform numerical studies of GMG on CPU–GPU heterogeneous computers.Furthermore,we compare our implementation with an efficient CPU implementation of GMG and with the most popular fast Poisson solver,Fast Fourier Transform,in the cuFFT library developed by NVIDIA.
基金This work is supported by the National Natural Science Foundation of China(No.61672358).
文摘The widespread application of heterogeneous cloud computing has enabled enormous advances in the real-time performance of telehealth systems.A cloud-based telehealth system allows healthcare users to obtain medical data from various data sources supported by heterogeneous cloud providers.Employing data duplications in distributed cloud databases is an alternative approach for achieving data sharing among multiple data users.However,this approach results in additional storage space being used,even though reducing data duplications would lead to a decrease in data acquisitions and real-time performance.To address this issue,this paper focuses on developing a dynamic data deduplication method that uses an intelligent blocker to determine the working mode of data duplications for each data package in heterogeneous cloud-based telehealth systems.The proposed approach is named the SD2M(Smart Data Deduplication Model),in which the main algorithm applies dynamic programming to produce optimal solutions to minimizing the total cost of data usage.We implement experimental evaluations to examine the adaptability of the proposed approach.
文摘With computing systems undergone a fundamen- tal transformation from single-processor devices at the turn of the century to the ubiquitous and networked devices and the warehouse-scale computing via the cloud, the parallelism has become ubiquitous at many levels. At micro level, par- allelisms are being explored from the underlying circuits, to pipelining and instruction level parallelism on multi-cores or many cores on a chip as well as in a machine. From macro level, parallelisms are being promoted from multiple ma- chines on a rack, many racks in a data center, to the glob- ally shared infrastructure of the Internet. With the push of big data, we are entering a new era of parallel computing driven by novel and ground breaking research innovation on elas- tic parallelism and scalability. In this paper, we will give an overview of computing infrastructure for big data processing, focusing on architectural, storage and networking challenges of supporting big data paper. We will briefly discuss emerging computing infrastructure and technologies that are promising for improving data parallelism, task parallelism and encour- aging vertical and horizontal computation parallelism.
基金Project supported by the National Key R&D Program,China(No.2016YFB1000204)。
文摘Nowadays,the management of resource contention in shared cloud remains a pending problem.The evolution and deployment of new application paradigms(e.g.,deep learning training and microservices)and custom hardware(e.g.,graphics processing unit(GPU)and tensor processing unit(TPU))have posed new challenges in resource management system design.Current solutions tend to trade cluster efficiency for guaranteed application performance,e.g.,resource over-allocation,leaving a lot of resources underutilized.Overcoming this dilemma is not easy,because different components across the software stack are involved.Nevertheless,massive efforts have been devoted to seeking effective performance isolation and highly efficient resource scheduling.The goal of this paper is to systematically cover related aspects to deliver the techniques from the coordination perspective,and to identify the corresponding trends they indicate.Briefly,four topics are involved.First,isolation mechanisms deployed at different levels(micro-architecture,system,and virtualization levels)are reviewed,including GPU multitasking methods.Second,resource scheduling techniques within an individual machine and at the cluster level are investigated,respectively.Particularly,GPU scheduling for deep learning applications is described in detail.Third,adaptive resource management including the latest microservice-related research is thoroughly explored.Finally,future research directions are discussed in the light of advanced work.We hope that this review paper will help researchers establish a global view of the landscape of resource management techniques in shared cloud,and see technology trends more clearly.