Symmetry is a common feature in the real world.It may be used to improve a classification by using the point symmetry-based distance as a measure of clustering.However,it is time consuming to calculate the point symme...Symmetry is a common feature in the real world.It may be used to improve a classification by using the point symmetry-based distance as a measure of clustering.However,it is time consuming to calculate the point symmetry-based distance.Although an efficient parallel point symmetry-based K-means algorithm(ParSym)has been propsed to overcome this limitation,ParSym may get stuck in sub-optimal solutions due to the K-means technique it used.In this study,we proposed a novel parallel point symmetry-based genetic clustering(ParSymG)algorithm for unsupervised classification.The genetic algorithm was introduced to overcome the sub-optimization problem caused by inappropriate selection of initial centroids in ParSym.A message passing interface(MPI)was used to implement the distributed master–slave paradigm.To make the algorithm more time-efficient,a three-phase speedup strategy was adopted for population initialization,image partition,and kd-tree structure-based nearest neighbor searching.The advantages of ParSymG over existing ParSym and parallel K-means(PKM)alogithms were demonstrated through case studies using three different types of remotely sensed images.Results in speedup and time gain proved the excellent scalability of the ParSymG algorithm.展开更多
Density-based algorithm for discovering clusters in large spatial databases with noise(DBSCAN) is a classic kind of density-based spatial clustering algorithm and is widely applied in several aspects due to good perfo...Density-based algorithm for discovering clusters in large spatial databases with noise(DBSCAN) is a classic kind of density-based spatial clustering algorithm and is widely applied in several aspects due to good performance in capturing arbitrary shapes and detecting outliers. However, in practice, datasets are always too massive to fit the serial DBSCAN. And a new parallel algorithm-Parallel DBSCAN(PDBSCAN) was proposed to solve the problem which DBSCAN faced. The proposed parallel algorithm bases on MapReduce mechanism. The usage of parallel mechanism in the algorithm focuses on region query and candidate queue processing which needed substantive computation resources. As a result, PDBSCAN is scalable for large-scale dataset clustering and is extremely suitable for applications in E-Commence, especially for recommendation.展开更多
Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than ...Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.展开更多
A computational strategy is presented for the nonlinear dynamic analysis of large- scale combined finite/discrete element systems on a PC cluster.In this strategy,a dual-level domain decomposition scheme is adopted to...A computational strategy is presented for the nonlinear dynamic analysis of large- scale combined finite/discrete element systems on a PC cluster.In this strategy,a dual-level domain decomposition scheme is adopted to implement the dynamic domain decomposition.The domain decomposition approach perfectly matches the requirement of reducing the memory size per processor of the calculation.To treat the contact between boundary elements in neighbouring subdomains,the elements in a subdomain are classified into internal,interfacial and external elements.In this way,all the contact detect algorithms developed for a sequential computation could be adopted directly in the parallel computation.Numerical examples show that this implementation is suitable for simulating large-scale problems.Two typical numerical examples are given to demonstrate the parallel efficiency and scalability on a PC cluster.展开更多
In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processin...In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processing compared with the powerful single workstation capacity, it is becoming severe important to keep balance not only for numerical load but also for communication load, and to overlap communications with computations while parallel computing. Hence,our efficiency evaluation rules must discover these capacities of a given parallel algorithm in order to optimize the existed algorithm to attain its highest parallel efficiency. The traditional efficiency evaluation rules can not succeed in this work any more. Fortunately, thanks to Culler's detail discuss in LogP model about interconnection networks for MPP systems, we present a system of efficiency evaluation rules for parallel computations under workstation cluster with PVM3.0 parallel software framework in this paper. These rules can satisfy above acquirements successfully. At last, two typical synchronous,and asynchronous applications are designed to verify the validity of these rules under 4 SGIs workstations cluster connected by Ethernet.展开更多
The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources.Building databases for non-redundant reference sequences from massive microbia...The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources.Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential.However,existing clustering algorithms perform poorly on long genomic sequences.In this article,we present Gclust,a parallel program for clustering complete or draft genomic sequences,where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays(SSAs).Moreover,genome identity measures between two sequences are calculated based on their maximal exact matches(MEMs).In this paper,we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets.Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust.We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.展开更多
Shared nothing spatial database cluster system provides high availability since a replicated node can continue service even if any node in cluster system was crashed. However if the failed node wouldn’t be recovered ...Shared nothing spatial database cluster system provides high availability since a replicated node can continue service even if any node in cluster system was crashed. However if the failed node wouldn’t be recovered quickly, whole system performance will decrease since the other nodes must process the queries which the failed node may be processed. Therefore the recovery of cluster system is very important to provide the stable service. In most previous proposed techniques, external logs should be recorded in all nodes even if the failed node does not exist. So update transactions are processed slowly. Also recovery time of the failed node increases since a single storage for all database is used to record external logs in each node. Therefore we propose a parallel recovery method for recovering the failed node quickly.展开更多
This paper presented an idea to replace the traditionally expensive parallel machines by heterogeneous cluster of workstations. To emphasise the usability of cluster of workstations platform for parallel and distribut...This paper presented an idea to replace the traditionally expensive parallel machines by heterogeneous cluster of workstations. To emphasise the usability of cluster of workstations platform for parallel and distributed computing, also the paper presented the status report on the effort and experiences for the implementation of a dynamic load balancing for parallel tree computation depth first search(DFS) on the cluster of a workstations project. It compared the speedup performance obtained from our platform with that obtained from the traditional one. The speedup results show that cluster of workstations can be a serious alternative to the expensive parallel machines.展开更多
The real problem in cluster of workstations is the changes in workstation power or number of workstations or dynmaic changes in the run time behavior of the application hamper the efficient use of resources. Dynamic l...The real problem in cluster of workstations is the changes in workstation power or number of workstations or dynmaic changes in the run time behavior of the application hamper the efficient use of resources. Dynamic load balancing is a technique for the parallel implementation of problems, which generate unpredictable workloads by migration work units from heavily loaded processor to lightly loaded processors at run time. This paper proposed an efficient load balancing method in which parallel tree computations depth first search (DFS) generates unpredictable, highly imbalance workloads and moves through different phases detectable at run time, where dynamic load balancing strategy is applicable in each phase running under the MPI(message passing interface) and Unix operating system on cluster of workstations parallel platform computing.展开更多
We propose a content-based parallel image retrieval system to achieve high responding ability. Our system is developed on cluster architectures. It has several retrieval. servers to supply the service of content-based...We propose a content-based parallel image retrieval system to achieve high responding ability. Our system is developed on cluster architectures. It has several retrieval. servers to supply the service of content-based image retrieval. It adopts the Browser/Server (B/S) mode. The users could visit our system though web pages. It uses the symmetrical color-spatial features (SCSF) to represent the content of an image. The SCSF is effective and efficient for image matching because it is independent of image distortion such as rotation and flip as well as it increases the matching accuracy. The SCSF was organized by M-tree, which could speedup the searching procedure. Our experiments show that the image matching is quickly and efficiently with the use of SCSF. And with the support of several retrieval servers, the system could respond to many users at mean time. Key words content-based image retrieval - cluster architecture - color-spatial feature - B/S mode - task parallel - WWW - Internet CLC number TP391 Foundation item: Supported by the National Natural Science Foundation of China (60173058)Biography: ZHOU Bing (1975-), male, Ph. D candidate, reseach direction: data mining, content-based image retrieval.展开更多
基金Thiswork was supported by the National Natural Science Foundation of China[grant number 41471313],[grant num-ber 41101356],[grant number 41671391]the Fundamental Research Funds for the Central Universities[grant num-ber 2016XZZX004-02]+1 种基金the Science and Technology Project of Zhejiang Province[grant number 2015C33021],[grant number 2013C33051]Major Program of China High Resolution Earth Observation System[grant number 07-Y30B10-9001].
文摘Symmetry is a common feature in the real world.It may be used to improve a classification by using the point symmetry-based distance as a measure of clustering.However,it is time consuming to calculate the point symmetry-based distance.Although an efficient parallel point symmetry-based K-means algorithm(ParSym)has been propsed to overcome this limitation,ParSym may get stuck in sub-optimal solutions due to the K-means technique it used.In this study,we proposed a novel parallel point symmetry-based genetic clustering(ParSymG)algorithm for unsupervised classification.The genetic algorithm was introduced to overcome the sub-optimization problem caused by inappropriate selection of initial centroids in ParSym.A message passing interface(MPI)was used to implement the distributed master–slave paradigm.To make the algorithm more time-efficient,a three-phase speedup strategy was adopted for population initialization,image partition,and kd-tree structure-based nearest neighbor searching.The advantages of ParSymG over existing ParSym and parallel K-means(PKM)alogithms were demonstrated through case studies using three different types of remotely sensed images.Results in speedup and time gain proved the excellent scalability of the ParSymG algorithm.
基金National Natural Science Foundations of China( No. 61070101,No. 60875029,No. 61175048)
文摘Density-based algorithm for discovering clusters in large spatial databases with noise(DBSCAN) is a classic kind of density-based spatial clustering algorithm and is widely applied in several aspects due to good performance in capturing arbitrary shapes and detecting outliers. However, in practice, datasets are always too massive to fit the serial DBSCAN. And a new parallel algorithm-Parallel DBSCAN(PDBSCAN) was proposed to solve the problem which DBSCAN faced. The proposed parallel algorithm bases on MapReduce mechanism. The usage of parallel mechanism in the algorithm focuses on region query and candidate queue processing which needed substantive computation resources. As a result, PDBSCAN is scalable for large-scale dataset clustering and is extremely suitable for applications in E-Commence, especially for recommendation.
文摘Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.
基金The project supported by the National Natural Science Foundation of China (10372114) and the Engineering and Physical Sciences Research Council (EPSRC) of UK (GR/R21219)
文摘A computational strategy is presented for the nonlinear dynamic analysis of large- scale combined finite/discrete element systems on a PC cluster.In this strategy,a dual-level domain decomposition scheme is adopted to implement the dynamic domain decomposition.The domain decomposition approach perfectly matches the requirement of reducing the memory size per processor of the calculation.To treat the contact between boundary elements in neighbouring subdomains,the elements in a subdomain are classified into internal,interfacial and external elements.In this way,all the contact detect algorithms developed for a sequential computation could be adopted directly in the parallel computation.Numerical examples show that this implementation is suitable for simulating large-scale problems.Two typical numerical examples are given to demonstrate the parallel efficiency and scalability on a PC cluster.
文摘In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processing compared with the powerful single workstation capacity, it is becoming severe important to keep balance not only for numerical load but also for communication load, and to overlap communications with computations while parallel computing. Hence,our efficiency evaluation rules must discover these capacities of a given parallel algorithm in order to optimize the existed algorithm to attain its highest parallel efficiency. The traditional efficiency evaluation rules can not succeed in this work any more. Fortunately, thanks to Culler's detail discuss in LogP model about interconnection networks for MPP systems, we present a system of efficiency evaluation rules for parallel computations under workstation cluster with PVM3.0 parallel software framework in this paper. These rules can satisfy above acquirements successfully. At last, two typical synchronous,and asynchronous applications are designed to verify the validity of these rules under 4 SGIs workstations cluster connected by Ethernet.
基金the National Key R&D Program of China(Grant Nos.2018YFB0203903,2016YFC0503607,and 2016YFB0200300)the National Natural Science Foundation of China(Grant Nos.31771466 and 61702476)+3 种基金the Transformation Project in Scientific and Technological Achievements of Qinghai Province,China(Grant No.2016-SF-127)the Special Project of Informatization(Grant No.XXH13504-08)the Strategic Pilot Science and Technology Project(Grant No.XDA12010000)the 100-Talents Program(awarded to BN)of the Chinese Academy of Sciences,China.
文摘The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources.Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential.However,existing clustering algorithms perform poorly on long genomic sequences.In this article,we present Gclust,a parallel program for clustering complete or draft genomic sequences,where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays(SSAs).Moreover,genome identity measures between two sequences are calculated based on their maximal exact matches(MEMs).In this paper,we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets.Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust.We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
基金This work is supported by University IT Research Center Project
文摘Shared nothing spatial database cluster system provides high availability since a replicated node can continue service even if any node in cluster system was crashed. However if the failed node wouldn’t be recovered quickly, whole system performance will decrease since the other nodes must process the queries which the failed node may be processed. Therefore the recovery of cluster system is very important to provide the stable service. In most previous proposed techniques, external logs should be recorded in all nodes even if the failed node does not exist. So update transactions are processed slowly. Also recovery time of the failed node increases since a single storage for all database is used to record external logs in each node. Therefore we propose a parallel recovery method for recovering the failed node quickly.
基金National Science Foundation of China(No.60 173 0 3 1)
文摘This paper presented an idea to replace the traditionally expensive parallel machines by heterogeneous cluster of workstations. To emphasise the usability of cluster of workstations platform for parallel and distributed computing, also the paper presented the status report on the effort and experiences for the implementation of a dynamic load balancing for parallel tree computation depth first search(DFS) on the cluster of a workstations project. It compared the speedup performance obtained from our platform with that obtained from the traditional one. The speedup results show that cluster of workstations can be a serious alternative to the expensive parallel machines.
基金Natural Science Foundation of China (No.60 173 0 3 1)
文摘The real problem in cluster of workstations is the changes in workstation power or number of workstations or dynmaic changes in the run time behavior of the application hamper the efficient use of resources. Dynamic load balancing is a technique for the parallel implementation of problems, which generate unpredictable workloads by migration work units from heavily loaded processor to lightly loaded processors at run time. This paper proposed an efficient load balancing method in which parallel tree computations depth first search (DFS) generates unpredictable, highly imbalance workloads and moves through different phases detectable at run time, where dynamic load balancing strategy is applicable in each phase running under the MPI(message passing interface) and Unix operating system on cluster of workstations parallel platform computing.
文摘We propose a content-based parallel image retrieval system to achieve high responding ability. Our system is developed on cluster architectures. It has several retrieval. servers to supply the service of content-based image retrieval. It adopts the Browser/Server (B/S) mode. The users could visit our system though web pages. It uses the symmetrical color-spatial features (SCSF) to represent the content of an image. The SCSF is effective and efficient for image matching because it is independent of image distortion such as rotation and flip as well as it increases the matching accuracy. The SCSF was organized by M-tree, which could speedup the searching procedure. Our experiments show that the image matching is quickly and efficiently with the use of SCSF. And with the support of several retrieval servers, the system could respond to many users at mean time. Key words content-based image retrieval - cluster architecture - color-spatial feature - B/S mode - task parallel - WWW - Internet CLC number TP391 Foundation item: Supported by the National Natural Science Foundation of China (60173058)Biography: ZHOU Bing (1975-), male, Ph. D candidate, reseach direction: data mining, content-based image retrieval.
基金Acknowledgment: This work is supported by Fujian Province Natural Science Foundation (No. 2008J0180) and Scientific Research Start Foundation of Fujian University of Technology (No. GY-Z0707).