The large-scale computations are often performed in science and engineering areas such as numerical weather forecasting, astrophysics, energy resources exploration, nuclear weapon design, and plasma fusion research et...The large-scale computations are often performed in science and engineering areas such as numerical weather forecasting, astrophysics, energy resources exploration, nuclear weapon design, and plasma fusion research etc. Many applications in these areas need super computing power. The traditional mode of sequential processing cannot meet the demands of those computations, thus, parallel processing(PP) is the main way of high performance computing (HPC) now.展开更多
A new file assignment strategy of parallel I/O, which is named heuristic file sorted assignment algorithm was proposed on cluster computing system. Based on the load balancing, it assigns the files to the same disk ac...A new file assignment strategy of parallel I/O, which is named heuristic file sorted assignment algorithm was proposed on cluster computing system. Based on the load balancing, it assigns the files to the same disk according to the similar service time. Firstly, the files were sorted and stored at the set I in descending order in terms of their service time, then one disk of cluster node was selected randomly when the files were to be assigned, and at last the continuous files were taken orderly from the set I to the disk until the disk reached its load maximum. The experimental results show that the new strategy improves the performance by 20.2% when the load of the system is light and by 31.6% when the load is heavy. And the higher the data access rate, the more evident the improvement of the performance obtained by the heuristic file sorted assignment algorithm.展开更多
Parallel finite element method using domain decomposition technique is adapted to a distributed parallel environment of workstation cluster. The algorithm is presented for parallelization of the preconditioned conjuga...Parallel finite element method using domain decomposition technique is adapted to a distributed parallel environment of workstation cluster. The algorithm is presented for parallelization of the preconditioned conjugate gradient method based on domain decomposition. Using the developed code, a dam structural analysis problem is solved on workstation cluster and results are given. The parallel performance is analyzed.展开更多
A computational strategy is presented for the nonlinear dynamic analysis of large- scale combined finite/discrete element systems on a PC cluster.In this strategy,a dual-level domain decomposition scheme is adopted to...A computational strategy is presented for the nonlinear dynamic analysis of large- scale combined finite/discrete element systems on a PC cluster.In this strategy,a dual-level domain decomposition scheme is adopted to implement the dynamic domain decomposition.The domain decomposition approach perfectly matches the requirement of reducing the memory size per processor of the calculation.To treat the contact between boundary elements in neighbouring subdomains,the elements in a subdomain are classified into internal,interfacial and external elements.In this way,all the contact detect algorithms developed for a sequential computation could be adopted directly in the parallel computation.Numerical examples show that this implementation is suitable for simulating large-scale problems.Two typical numerical examples are given to demonstrate the parallel efficiency and scalability on a PC cluster.展开更多
This paper presented an idea to replace the traditionally expensive parallel machines by heterogeneous cluster of workstations. To emphasise the usability of cluster of workstations platform for parallel and distribut...This paper presented an idea to replace the traditionally expensive parallel machines by heterogeneous cluster of workstations. To emphasise the usability of cluster of workstations platform for parallel and distributed computing, also the paper presented the status report on the effort and experiences for the implementation of a dynamic load balancing for parallel tree computation depth first search(DFS) on the cluster of a workstations project. It compared the speedup performance obtained from our platform with that obtained from the traditional one. The speedup results show that cluster of workstations can be a serious alternative to the expensive parallel machines.展开更多
The rapid growth of interconnected high performance workstations has produced a new computing paradigm called clustered of workstations computing. In these systems load balance problem is a serious impediment to achie...The rapid growth of interconnected high performance workstations has produced a new computing paradigm called clustered of workstations computing. In these systems load balance problem is a serious impediment to achieve good performance. The main concern of this paper is the implementation of dynamic load balancing algorithm, asynchronous Round Robin (ARR), for balancing workload of parallel tree computation depth-first-search algorithm on Cluster of Heterogeneous Workstations (COW) Many algorithms in artificial intelligence and other areas of computer science are based on depth first search in implicitty defined trees. For these algorithms a load-balancing scheme is required, which is able to evenly distribute parts of an irregularly shaped tree over the workstations with minimal interprocessor communication and without prior knowledge of the tree’s shape. For the (ARR) algorithm only minimal interprocessor communication is needed when necessary and it runs under the MPI (Message passing interface) that allows parallel execution on heterogeneous SUN cluster of workstation platform. The program code is written in C language and executed under UNIX operating system (Solaris version).展开更多
Using commodity SMPs (shared memory processors) to build cluster-based supercomputer has become a mainstream trend.Yet programming this kind of supercomputer system requires an environment support both message passing...Using commodity SMPs (shared memory processors) to build cluster-based supercomputer has become a mainstream trend.Yet programming this kind of supercomputer system requires an environment support both message passing and shared memory programming. This paper describes our preliminary work in an effort to target BSP library for cluster of SMPs. In order to exploit the maximum performance potential that a cluster of SMPs brings, we adopt thread technique to reduce system overhead and to exploit the capacity of SMPs. A fore-layer synchronization mechanism is proposed to support barrier synchronization within an SMP node, a group of SMP nodes and the whole cluster respectively. A comparison is made between our BSP library and the currently available BSP libraries such as PUB.展开更多
CMAQ(Community Multiscale Air Quality)涉及海量空间数据、复杂的处理模型和苛刻的时间需求,但高密集的计算操作使得串行CMaQ面临计算瓶颈问题,昂贵的巨型高性能专用机对于普通研究者望尘莫及,因此基于Linux Cluster的并行CMAQ...CMAQ(Community Multiscale Air Quality)涉及海量空间数据、复杂的处理模型和苛刻的时间需求,但高密集的计算操作使得串行CMaQ面临计算瓶颈问题,昂贵的巨型高性能专用机对于普通研究者望尘莫及,因此基于Linux Cluster的并行CMAQ研究是解决该问题的重要途径。本文以开源CMAQ为研究对象,探讨基于Linux Cluster的并行CMAQ的计算模式、体系结构、并行模式、软件框架等,并构建了相应的原型系统。实验表明相对于传统的串行架构,所提出的并行架构在计算效率上有了显著提高。展开更多
The density peak (DP) algorithm has been widely used in scientific research due to its novel and effective peak density-based clustering approach. However, the DP algorithm uses each pair of data points several time...The density peak (DP) algorithm has been widely used in scientific research due to its novel and effective peak density-based clustering approach. However, the DP algorithm uses each pair of data points several times when determining cluster centers, yielding high computational complexity. In this paper, we focus on accelerating the time-consuming density peaks algorithm with a graphics processing unit (GPU). We analyze the principle of the algorithm to locate its computational bottlenecks, and evaluate its potential for parallelism. In light of our analysis, we propose an efficient parallel DP algorithm targeting on a GPU architecture and implement this parallel method with compute unified device architecture (CUDA), called the ‘CUDA-DP platform'. Specifically, we use shared memory to improve data locality, which reduces the amount of global memory access. To exploit the coalescing accessing mechanism of CPU, we convert the data structure of the CUDA-DP program from array of structures to structure of arrays. In addition, we introduce a binary search-and-sampling method to avoid sorting a large array. The results of the experiment show that CUDA-DP can achieve a 45-fold acceleration when compared to the central processing unit based density peaks implementation.展开更多
In response to the problem of how to give geographic information system(GIS)high-performance capabilities for certain specific GIS applications,a new GIS research direction,parallel GIS processing,has emerged.However,...In response to the problem of how to give geographic information system(GIS)high-performance capabilities for certain specific GIS applications,a new GIS research direction,parallel GIS processing,has emerged.However,traditional research has focused mostly on implementing typical GIS parallel algorithms,with little discussion of how to parallelize an entire GIS package on clusters based on theory.Therefore,the authors have chosen the geographic resources analysis support system(GRASS)GIS as the object of their research and have put forward the concept of a cluster-based open-source parallel GIS(cluster-based OP-GIS)as a tool to support Digital Earth construction.The related theory includes not only the parallel computing mode,architecture,and software framework of such a system,but also various parallelization patterns.From experiments on the prototype system,it can be concluded that the parallel system has better efficiency and performance than the conventional system on certain selected modules.展开更多
In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processin...In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processing compared with the powerful single workstation capacity, it is becoming severe important to keep balance not only for numerical load but also for communication load, and to overlap communications with computations while parallel computing. Hence,our efficiency evaluation rules must discover these capacities of a given parallel algorithm in order to optimize the existed algorithm to attain its highest parallel efficiency. The traditional efficiency evaluation rules can not succeed in this work any more. Fortunately, thanks to Culler's detail discuss in LogP model about interconnection networks for MPP systems, we present a system of efficiency evaluation rules for parallel computations under workstation cluster with PVM3.0 parallel software framework in this paper. These rules can satisfy above acquirements successfully. At last, two typical synchronous,and asynchronous applications are designed to verify the validity of these rules under 4 SGIs workstations cluster connected by Ethernet.展开更多
This study has established the functions between the environmental conditions and the inhabitants’ preferences for multi-story row house cluster with parallel layout, based on the data from the questionnaires’ in Be...This study has established the functions between the environmental conditions and the inhabitants’ preferences for multi-story row house cluster with parallel layout, based on the data from the questionnaires’ in Beijing. A program has been written with multi-agent system and generative computer simulation approaches. The emergence candidate layout plans can be referenced and chosen by architects.展开更多
The real problem in cluster of workstations is the changes in workstation power or number of workstations or dynmaic changes in the run time behavior of the application hamper the efficient use of resources. Dynamic l...The real problem in cluster of workstations is the changes in workstation power or number of workstations or dynmaic changes in the run time behavior of the application hamper the efficient use of resources. Dynamic load balancing is a technique for the parallel implementation of problems, which generate unpredictable workloads by migration work units from heavily loaded processor to lightly loaded processors at run time. This paper proposed an efficient load balancing method in which parallel tree computations depth first search (DFS) generates unpredictable, highly imbalance workloads and moves through different phases detectable at run time, where dynamic load balancing strategy is applicable in each phase running under the MPI(message passing interface) and Unix operating system on cluster of workstations parallel platform computing.展开更多
文摘The large-scale computations are often performed in science and engineering areas such as numerical weather forecasting, astrophysics, energy resources exploration, nuclear weapon design, and plasma fusion research etc. Many applications in these areas need super computing power. The traditional mode of sequential processing cannot meet the demands of those computations, thus, parallel processing(PP) is the main way of high performance computing (HPC) now.
文摘A new file assignment strategy of parallel I/O, which is named heuristic file sorted assignment algorithm was proposed on cluster computing system. Based on the load balancing, it assigns the files to the same disk according to the similar service time. Firstly, the files were sorted and stored at the set I in descending order in terms of their service time, then one disk of cluster node was selected randomly when the files were to be assigned, and at last the continuous files were taken orderly from the set I to the disk until the disk reached its load maximum. The experimental results show that the new strategy improves the performance by 20.2% when the load of the system is light and by 31.6% when the load is heavy. And the higher the data access rate, the more evident the improvement of the performance obtained by the heuristic file sorted assignment algorithm.
基金Project supported by Key Project Science Foundation of ShanghaiMunicipal Commission of Education (Grant No .03AZ03)
文摘Parallel finite element method using domain decomposition technique is adapted to a distributed parallel environment of workstation cluster. The algorithm is presented for parallelization of the preconditioned conjugate gradient method based on domain decomposition. Using the developed code, a dam structural analysis problem is solved on workstation cluster and results are given. The parallel performance is analyzed.
基金The project supported by the National Natural Science Foundation of China (10372114) and the Engineering and Physical Sciences Research Council (EPSRC) of UK (GR/R21219)
文摘A computational strategy is presented for the nonlinear dynamic analysis of large- scale combined finite/discrete element systems on a PC cluster.In this strategy,a dual-level domain decomposition scheme is adopted to implement the dynamic domain decomposition.The domain decomposition approach perfectly matches the requirement of reducing the memory size per processor of the calculation.To treat the contact between boundary elements in neighbouring subdomains,the elements in a subdomain are classified into internal,interfacial and external elements.In this way,all the contact detect algorithms developed for a sequential computation could be adopted directly in the parallel computation.Numerical examples show that this implementation is suitable for simulating large-scale problems.Two typical numerical examples are given to demonstrate the parallel efficiency and scalability on a PC cluster.
基金National Science Foundation of China(No.60 173 0 3 1)
文摘This paper presented an idea to replace the traditionally expensive parallel machines by heterogeneous cluster of workstations. To emphasise the usability of cluster of workstations platform for parallel and distributed computing, also the paper presented the status report on the effort and experiences for the implementation of a dynamic load balancing for parallel tree computation depth first search(DFS) on the cluster of a workstations project. It compared the speedup performance obtained from our platform with that obtained from the traditional one. The speedup results show that cluster of workstations can be a serious alternative to the expensive parallel machines.
文摘The rapid growth of interconnected high performance workstations has produced a new computing paradigm called clustered of workstations computing. In these systems load balance problem is a serious impediment to achieve good performance. The main concern of this paper is the implementation of dynamic load balancing algorithm, asynchronous Round Robin (ARR), for balancing workload of parallel tree computation depth-first-search algorithm on Cluster of Heterogeneous Workstations (COW) Many algorithms in artificial intelligence and other areas of computer science are based on depth first search in implicitty defined trees. For these algorithms a load-balancing scheme is required, which is able to evenly distribute parts of an irregularly shaped tree over the workstations with minimal interprocessor communication and without prior knowledge of the tree’s shape. For the (ARR) algorithm only minimal interprocessor communication is needed when necessary and it runs under the MPI (Message passing interface) that allows parallel execution on heterogeneous SUN cluster of workstation platform. The program code is written in C language and executed under UNIX operating system (Solaris version).
基金Acknowledgment: This work is supported by Fujian Province Natural Science Foundation (No. 2008J0180) and Scientific Research Start Foundation of Fujian University of Technology (No. GY-Z0707).
基金the National Natural Science Foundation of China(69603005), and the Science Foundation of Shanghai MunicipalCommission of Sc
文摘Using commodity SMPs (shared memory processors) to build cluster-based supercomputer has become a mainstream trend.Yet programming this kind of supercomputer system requires an environment support both message passing and shared memory programming. This paper describes our preliminary work in an effort to target BSP library for cluster of SMPs. In order to exploit the maximum performance potential that a cluster of SMPs brings, we adopt thread technique to reduce system overhead and to exploit the capacity of SMPs. A fore-layer synchronization mechanism is proposed to support barrier synchronization within an SMP node, a group of SMP nodes and the whole cluster respectively. A comparison is made between our BSP library and the currently available BSP libraries such as PUB.
基金supported by the National Basic Research Program(973)of China(No.2014CB340303)the National Natural Science Foundation of China(Nos.61502509 and 61222205)+1 种基金the Program for New Century Excellent Talents in Universitythe Fok Ying-Tong Education Foundation(No.141066)
文摘The density peak (DP) algorithm has been widely used in scientific research due to its novel and effective peak density-based clustering approach. However, the DP algorithm uses each pair of data points several times when determining cluster centers, yielding high computational complexity. In this paper, we focus on accelerating the time-consuming density peaks algorithm with a graphics processing unit (GPU). We analyze the principle of the algorithm to locate its computational bottlenecks, and evaluate its potential for parallelism. In light of our analysis, we propose an efficient parallel DP algorithm targeting on a GPU architecture and implement this parallel method with compute unified device architecture (CUDA), called the ‘CUDA-DP platform'. Specifically, we use shared memory to improve data locality, which reduces the amount of global memory access. To exploit the coalescing accessing mechanism of CPU, we convert the data structure of the CUDA-DP program from array of structures to structure of arrays. In addition, we introduce a binary search-and-sampling method to avoid sorting a large array. The results of the experiment show that CUDA-DP can achieve a 45-fold acceleration when compared to the central processing unit based density peaks implementation.
基金This work was supported by the Fundamental Research Funds for the Central Universities(Grant No.ZYGX2009J073)the National Natural Science Foundation of China(Grant No.41001221&40701146)+1 种基金also the National High-Tech R&D Program of China(Grant No.2007AA12Z227)Thanks to Anthony Lewis,the Letters Editor of IJDE,and the anonymous reviewers for their positive and constructive suggestions to improve the paper.
文摘In response to the problem of how to give geographic information system(GIS)high-performance capabilities for certain specific GIS applications,a new GIS research direction,parallel GIS processing,has emerged.However,traditional research has focused mostly on implementing typical GIS parallel algorithms,with little discussion of how to parallelize an entire GIS package on clusters based on theory.Therefore,the authors have chosen the geographic resources analysis support system(GRASS)GIS as the object of their research and have put forward the concept of a cluster-based open-source parallel GIS(cluster-based OP-GIS)as a tool to support Digital Earth construction.The related theory includes not only the parallel computing mode,architecture,and software framework of such a system,but also various parallelization patterns.From experiments on the prototype system,it can be concluded that the parallel system has better efficiency and performance than the conventional system on certain selected modules.
文摘In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processing compared with the powerful single workstation capacity, it is becoming severe important to keep balance not only for numerical load but also for communication load, and to overlap communications with computations while parallel computing. Hence,our efficiency evaluation rules must discover these capacities of a given parallel algorithm in order to optimize the existed algorithm to attain its highest parallel efficiency. The traditional efficiency evaluation rules can not succeed in this work any more. Fortunately, thanks to Culler's detail discuss in LogP model about interconnection networks for MPP systems, we present a system of efficiency evaluation rules for parallel computations under workstation cluster with PVM3.0 parallel software framework in this paper. These rules can satisfy above acquirements successfully. At last, two typical synchronous,and asynchronous applications are designed to verify the validity of these rules under 4 SGIs workstations cluster connected by Ethernet.
基金supported by the National Natural Science Foundation of China (Grant No. 50608042)
文摘This study has established the functions between the environmental conditions and the inhabitants’ preferences for multi-story row house cluster with parallel layout, based on the data from the questionnaires’ in Beijing. A program has been written with multi-agent system and generative computer simulation approaches. The emergence candidate layout plans can be referenced and chosen by architects.
基金Natural Science Foundation of China (No.60 173 0 3 1)
文摘The real problem in cluster of workstations is the changes in workstation power or number of workstations or dynmaic changes in the run time behavior of the application hamper the efficient use of resources. Dynamic load balancing is a technique for the parallel implementation of problems, which generate unpredictable workloads by migration work units from heavily loaded processor to lightly loaded processors at run time. This paper proposed an efficient load balancing method in which parallel tree computations depth first search (DFS) generates unpredictable, highly imbalance workloads and moves through different phases detectable at run time, where dynamic load balancing strategy is applicable in each phase running under the MPI(message passing interface) and Unix operating system on cluster of workstations parallel platform computing.