The K-means algorithm is widely known for its simplicity and fastness in text clustering.However,the selection of the initial clus?tering center with the traditional K-means algorithm is some random,and therefore,the ...The K-means algorithm is widely known for its simplicity and fastness in text clustering.However,the selection of the initial clus?tering center with the traditional K-means algorithm is some random,and therefore,the fluctuations and instability of the clustering results are strongly affected by the initial clustering center.This paper proposed an algorithm to select the initial clustering center to eliminate the uncertainty of central point selection.The experiment results show that the improved K-means clustering algorithm is superior to the traditional algorithm.展开更多
Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web the...Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web theme information, this paper proposes and implements dynamic conceptual clustering algorithm and merging algorithm for Web documents, and also analyses the super performance of the clustering algorithm in efficiency and clustering accuracy. Key words conceptual clustering - clustering center - dynamic conceptual clustering - theme - web documents clustering CLC number TP 311 Foundation item: Supported by the National “863” Program of China (2002AA111010, 2003AA001032)Biography: WANG Yun-hua(1979-), male, Master candidate, research direction: knowledge engineering and data mining.展开更多
Fuzzy C-Means(FCM)is an effective and widely used clustering algorithm,but there are still some problems.considering the number of clusters must be determined manually,the local optimal solutions is easily influenced ...Fuzzy C-Means(FCM)is an effective and widely used clustering algorithm,but there are still some problems.considering the number of clusters must be determined manually,the local optimal solutions is easily influenced by the random selection of initial cluster centers,and the performance of Euclid distance in complex high-dimensional data is poor.To solve the above problems,the improved FCM clustering algorithm based on density Canopy and Manifold learning(DM-FCM)is proposed.First,a density Canopy algorithm based on improved local density is proposed to automatically deter-mine the number of clusters and initial cluster centers,which improves the self-adaptability and stability of the algorithm.Then,considering that high-dimensional data often present a nonlinear structure,the manifold learning method is applied to construct a manifold spatial structure,which preserves the global geometric properties of complex high-dimensional data and improves the clustering effect of the algorithm on complex high-dimensional datasets.Fowlkes-Mallows Index(FMI),the weighted average of homogeneity and completeness(V-measure),Adjusted Mutual Information(AMI),and Adjusted Rand Index(ARI)are used as performance measures of clustering algorithms.The experimental results show that the manifold learning method is the superior distance measure,and the algorithm improves the clustering accuracy and performs superiorly in the clustering of low-dimensional and complex high-dimensional data.展开更多
The volume of information that needs to be processed in big data clusters increases rapidly nowadays. It is critical to execute the data analysis in a time-efficient manner. However, simply adding more computation res...The volume of information that needs to be processed in big data clusters increases rapidly nowadays. It is critical to execute the data analysis in a time-efficient manner. However, simply adding more computation resources may not speed up the data analysis significantly. The data analysis jobs usually consist of multiple stages which are organized as a directed acyclic graph (DAG). The precedence relationships between stages cause scheduling challenges. General DAG scheduling is a well-known NP-hard problem. Moreover, we observe that in some parallel computing frameworks such as Spark, the execution of a stage in DAG contains multiple phases that use different resources. We notice that carefully arranging the execution of those resources in pipeline can reduce their idle time and improve the average resource utilization. Therefore, we propose a resource pipeline scheme with the objective of minimizing the job makespan. For perfectly parallel stages, we propose a contention-free scheduler with detailed theoretical analysis. Moreover, we extend the contention-free scheduler for three-phase stages, considering the computation phase of some stages can be partitioned. Additionally, we are aware that job stages in real-world applications are usually not perfectly parallel. We need to frequently adjust the parallelism levels during the DAG execution. Considering reinforcement learning (RL) techniques can adjust the scheduling policy on the fly, we investigate a scheduler based on RL for online arrival jobs. The RL-based scheduler can adjust the resource contention adaptively. We evaluate both contention-free and RL-based schedulers on a Spark cluster. In the evaluation, a real-world cluster trace dataset is used to simulate different DAG styles. Evaluation results show that our pipelined scheme can significantly improve CPU and network utilization.展开更多
文摘The K-means algorithm is widely known for its simplicity and fastness in text clustering.However,the selection of the initial clus?tering center with the traditional K-means algorithm is some random,and therefore,the fluctuations and instability of the clustering results are strongly affected by the initial clustering center.This paper proposed an algorithm to select the initial clustering center to eliminate the uncertainty of central point selection.The experiment results show that the improved K-means clustering algorithm is superior to the traditional algorithm.
文摘Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web theme information, this paper proposes and implements dynamic conceptual clustering algorithm and merging algorithm for Web documents, and also analyses the super performance of the clustering algorithm in efficiency and clustering accuracy. Key words conceptual clustering - clustering center - dynamic conceptual clustering - theme - web documents clustering CLC number TP 311 Foundation item: Supported by the National “863” Program of China (2002AA111010, 2003AA001032)Biography: WANG Yun-hua(1979-), male, Master candidate, research direction: knowledge engineering and data mining.
基金The National Natural Science Foundation of China(No.62262011)the Natural Science Foundation of Guangxi(No.2021JJA170130).
文摘Fuzzy C-Means(FCM)is an effective and widely used clustering algorithm,but there are still some problems.considering the number of clusters must be determined manually,the local optimal solutions is easily influenced by the random selection of initial cluster centers,and the performance of Euclid distance in complex high-dimensional data is poor.To solve the above problems,the improved FCM clustering algorithm based on density Canopy and Manifold learning(DM-FCM)is proposed.First,a density Canopy algorithm based on improved local density is proposed to automatically deter-mine the number of clusters and initial cluster centers,which improves the self-adaptability and stability of the algorithm.Then,considering that high-dimensional data often present a nonlinear structure,the manifold learning method is applied to construct a manifold spatial structure,which preserves the global geometric properties of complex high-dimensional data and improves the clustering effect of the algorithm on complex high-dimensional datasets.Fowlkes-Mallows Index(FMI),the weighted average of homogeneity and completeness(V-measure),Adjusted Mutual Information(AMI),and Adjusted Rand Index(ARI)are used as performance measures of clustering algorithms.The experimental results show that the manifold learning method is the superior distance measure,and the algorithm improves the clustering accuracy and performs superiorly in the clustering of low-dimensional and complex high-dimensional data.
基金This work was supported in part by the National Science Foundation of U.S.A.under Grant Nos.~CNS 2128378,CNS 2107014,CNS 1824440,CNS 1828363,CNS 1757533,CNS 1629746,and CNS 1651947.
文摘The volume of information that needs to be processed in big data clusters increases rapidly nowadays. It is critical to execute the data analysis in a time-efficient manner. However, simply adding more computation resources may not speed up the data analysis significantly. The data analysis jobs usually consist of multiple stages which are organized as a directed acyclic graph (DAG). The precedence relationships between stages cause scheduling challenges. General DAG scheduling is a well-known NP-hard problem. Moreover, we observe that in some parallel computing frameworks such as Spark, the execution of a stage in DAG contains multiple phases that use different resources. We notice that carefully arranging the execution of those resources in pipeline can reduce their idle time and improve the average resource utilization. Therefore, we propose a resource pipeline scheme with the objective of minimizing the job makespan. For perfectly parallel stages, we propose a contention-free scheduler with detailed theoretical analysis. Moreover, we extend the contention-free scheduler for three-phase stages, considering the computation phase of some stages can be partitioned. Additionally, we are aware that job stages in real-world applications are usually not perfectly parallel. We need to frequently adjust the parallelism levels during the DAG execution. Considering reinforcement learning (RL) techniques can adjust the scheduling policy on the fly, we investigate a scheduler based on RL for online arrival jobs. The RL-based scheduler can adjust the resource contention adaptively. We evaluate both contention-free and RL-based schedulers on a Spark cluster. In the evaluation, a real-world cluster trace dataset is used to simulate different DAG styles. Evaluation results show that our pipelined scheme can significantly improve CPU and network utilization.