The K-means algorithm is widely known for its simplicity and fastness in text clustering.However,the selection of the initial clus?tering center with the traditional K-means algorithm is some random,and therefore,the ...The K-means algorithm is widely known for its simplicity and fastness in text clustering.However,the selection of the initial clus?tering center with the traditional K-means algorithm is some random,and therefore,the fluctuations and instability of the clustering results are strongly affected by the initial clustering center.This paper proposed an algorithm to select the initial clustering center to eliminate the uncertainty of central point selection.The experiment results show that the improved K-means clustering algorithm is superior to the traditional algorithm.展开更多
Although k-nearest neighbors (KNN) is a popular fingerprint match algorithm for its simplicity and accuracy, because it is sensitive to the circumstances, a fuzzy c-means (FCM) clustering algorithm is applied to i...Although k-nearest neighbors (KNN) is a popular fingerprint match algorithm for its simplicity and accuracy, because it is sensitive to the circumstances, a fuzzy c-means (FCM) clustering algorithm is applied to improve it. Thus, a KNN-based two-step FCM weighted (KTFW) algorithm for indoor positioning in wireless local area networks (WLAN) is presented in this paper. In KTFW algorithm, k reference points (RPs) chosen by KNN are clustered through FCM based on received signal strength (RSS) and location coordinates. The right clusters are chosen according to rules, so three sets of RPs are formed including the set of k RPs chosen by KNN and are given different weights. RPs supposed to have better contribution to positioning accuracy are given larger weights to improve the positioning accuracy. Simulation results indicate that KTFW generally outperforms KNN and its complexity is greatly reduced through providing initial clustering centers for FCM.展开更多
Fuzzy C-Means(FCM)is an effective and widely used clustering algorithm,but there are still some problems.considering the number of clusters must be determined manually,the local optimal solutions is easily influenced ...Fuzzy C-Means(FCM)is an effective and widely used clustering algorithm,but there are still some problems.considering the number of clusters must be determined manually,the local optimal solutions is easily influenced by the random selection of initial cluster centers,and the performance of Euclid distance in complex high-dimensional data is poor.To solve the above problems,the improved FCM clustering algorithm based on density Canopy and Manifold learning(DM-FCM)is proposed.First,a density Canopy algorithm based on improved local density is proposed to automatically deter-mine the number of clusters and initial cluster centers,which improves the self-adaptability and stability of the algorithm.Then,considering that high-dimensional data often present a nonlinear structure,the manifold learning method is applied to construct a manifold spatial structure,which preserves the global geometric properties of complex high-dimensional data and improves the clustering effect of the algorithm on complex high-dimensional datasets.Fowlkes-Mallows Index(FMI),the weighted average of homogeneity and completeness(V-measure),Adjusted Mutual Information(AMI),and Adjusted Rand Index(ARI)are used as performance measures of clustering algorithms.The experimental results show that the manifold learning method is the superior distance measure,and the algorithm improves the clustering accuracy and performs superiorly in the clustering of low-dimensional and complex high-dimensional data.展开更多
Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web the...Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web theme information, this paper proposes and implements dynamic conceptual clustering algorithm and merging algorithm for Web documents, and also analyses the super performance of the clustering algorithm in efficiency and clustering accuracy. Key words conceptual clustering - clustering center - dynamic conceptual clustering - theme - web documents clustering CLC number TP 311 Foundation item: Supported by the National “863” Program of China (2002AA111010, 2003AA001032)Biography: WANG Yun-hua(1979-), male, Master candidate, research direction: knowledge engineering and data mining.展开更多
Intrusion detection aims to detect intrusion behavior and serves as a complement to firewalls.It can detect attack types of malicious network communications and computer usage that cannot be detected by idiomatic fire...Intrusion detection aims to detect intrusion behavior and serves as a complement to firewalls.It can detect attack types of malicious network communications and computer usage that cannot be detected by idiomatic firewalls.Many intrusion detection methods are processed through machine learning.Previous literature has shown that the performance of an intrusion detection method based on hybrid learning or integration approach is superior to that of single learning technology.However,almost no studies focus on how additional representative and concise features can be extracted to process effective intrusion detection among massive and complicated data.In this paper,a new hybrid learning method is proposed on the basis of features such as density,cluster centers,and nearest neighbors(DCNN).In this algorithm,data is represented by the local density of each sample point and the sum of distances from each sample point to cluster centers and to its nearest neighbor.k-NN classifier is adopted to classify the new feature vectors.Our experiment shows that DCNN,which combines K-means,clustering-based density,and k-NN classifier,is effective in intrusion detection.展开更多
This paper presents an advanced fuzzy C-means(FCM) clustering algorithm to overcome the weakness of the traditional FCM algorithm, including the instability of random selecting of initial center and the limitation of ...This paper presents an advanced fuzzy C-means(FCM) clustering algorithm to overcome the weakness of the traditional FCM algorithm, including the instability of random selecting of initial center and the limitation of the data separation or the size of clusters. The advanced FCM algorithm combines the distance with density and improves the objective function so that the performance of the algorithm can be improved. The experimental results show that the proposed FCM algorithm requires fewer iterations yet provides higher accuracy than the traditional FCM algorithm. The advanced algorithm is applied to the influence of stars' box-office data, and the classification accuracy of the first class stars achieves 92.625%.展开更多
The volume of information that needs to be processed in big data clusters increases rapidly nowadays. It is critical to execute the data analysis in a time-efficient manner. However, simply adding more computation res...The volume of information that needs to be processed in big data clusters increases rapidly nowadays. It is critical to execute the data analysis in a time-efficient manner. However, simply adding more computation resources may not speed up the data analysis significantly. The data analysis jobs usually consist of multiple stages which are organized as a directed acyclic graph (DAG). The precedence relationships between stages cause scheduling challenges. General DAG scheduling is a well-known NP-hard problem. Moreover, we observe that in some parallel computing frameworks such as Spark, the execution of a stage in DAG contains multiple phases that use different resources. We notice that carefully arranging the execution of those resources in pipeline can reduce their idle time and improve the average resource utilization. Therefore, we propose a resource pipeline scheme with the objective of minimizing the job makespan. For perfectly parallel stages, we propose a contention-free scheduler with detailed theoretical analysis. Moreover, we extend the contention-free scheduler for three-phase stages, considering the computation phase of some stages can be partitioned. Additionally, we are aware that job stages in real-world applications are usually not perfectly parallel. We need to frequently adjust the parallelism levels during the DAG execution. Considering reinforcement learning (RL) techniques can adjust the scheduling policy on the fly, we investigate a scheduler based on RL for online arrival jobs. The RL-based scheduler can adjust the resource contention adaptively. We evaluate both contention-free and RL-based schedulers on a Spark cluster. In the evaluation, a real-world cluster trace dataset is used to simulate different DAG styles. Evaluation results show that our pipelined scheme can significantly improve CPU and network utilization.展开更多
文摘The K-means algorithm is widely known for its simplicity and fastness in text clustering.However,the selection of the initial clus?tering center with the traditional K-means algorithm is some random,and therefore,the fluctuations and instability of the clustering results are strongly affected by the initial clustering center.This paper proposed an algorithm to select the initial clustering center to eliminate the uncertainty of central point selection.The experiment results show that the improved K-means clustering algorithm is superior to the traditional algorithm.
文摘Although k-nearest neighbors (KNN) is a popular fingerprint match algorithm for its simplicity and accuracy, because it is sensitive to the circumstances, a fuzzy c-means (FCM) clustering algorithm is applied to improve it. Thus, a KNN-based two-step FCM weighted (KTFW) algorithm for indoor positioning in wireless local area networks (WLAN) is presented in this paper. In KTFW algorithm, k reference points (RPs) chosen by KNN are clustered through FCM based on received signal strength (RSS) and location coordinates. The right clusters are chosen according to rules, so three sets of RPs are formed including the set of k RPs chosen by KNN and are given different weights. RPs supposed to have better contribution to positioning accuracy are given larger weights to improve the positioning accuracy. Simulation results indicate that KTFW generally outperforms KNN and its complexity is greatly reduced through providing initial clustering centers for FCM.
基金The National Natural Science Foundation of China(No.62262011)the Natural Science Foundation of Guangxi(No.2021JJA170130).
文摘Fuzzy C-Means(FCM)is an effective and widely used clustering algorithm,but there are still some problems.considering the number of clusters must be determined manually,the local optimal solutions is easily influenced by the random selection of initial cluster centers,and the performance of Euclid distance in complex high-dimensional data is poor.To solve the above problems,the improved FCM clustering algorithm based on density Canopy and Manifold learning(DM-FCM)is proposed.First,a density Canopy algorithm based on improved local density is proposed to automatically deter-mine the number of clusters and initial cluster centers,which improves the self-adaptability and stability of the algorithm.Then,considering that high-dimensional data often present a nonlinear structure,the manifold learning method is applied to construct a manifold spatial structure,which preserves the global geometric properties of complex high-dimensional data and improves the clustering effect of the algorithm on complex high-dimensional datasets.Fowlkes-Mallows Index(FMI),the weighted average of homogeneity and completeness(V-measure),Adjusted Mutual Information(AMI),and Adjusted Rand Index(ARI)are used as performance measures of clustering algorithms.The experimental results show that the manifold learning method is the superior distance measure,and the algorithm improves the clustering accuracy and performs superiorly in the clustering of low-dimensional and complex high-dimensional data.
文摘Conceptual clustering is mainly used for solving the deficiency and incompleteness of domain knowledge. Based on conceptual clustering technology and aiming at the institutional framework and characteristic of Web theme information, this paper proposes and implements dynamic conceptual clustering algorithm and merging algorithm for Web documents, and also analyses the super performance of the clustering algorithm in efficiency and clustering accuracy. Key words conceptual clustering - clustering center - dynamic conceptual clustering - theme - web documents clustering CLC number TP 311 Foundation item: Supported by the National “863” Program of China (2002AA111010, 2003AA001032)Biography: WANG Yun-hua(1979-), male, Master candidate, research direction: knowledge engineering and data mining.
文摘Intrusion detection aims to detect intrusion behavior and serves as a complement to firewalls.It can detect attack types of malicious network communications and computer usage that cannot be detected by idiomatic firewalls.Many intrusion detection methods are processed through machine learning.Previous literature has shown that the performance of an intrusion detection method based on hybrid learning or integration approach is superior to that of single learning technology.However,almost no studies focus on how additional representative and concise features can be extracted to process effective intrusion detection among massive and complicated data.In this paper,a new hybrid learning method is proposed on the basis of features such as density,cluster centers,and nearest neighbors(DCNN).In this algorithm,data is represented by the local density of each sample point and the sum of distances from each sample point to cluster centers and to its nearest neighbor.k-NN classifier is adopted to classify the new feature vectors.Our experiment shows that DCNN,which combines K-means,clustering-based density,and k-NN classifier,is effective in intrusion detection.
文摘This paper presents an advanced fuzzy C-means(FCM) clustering algorithm to overcome the weakness of the traditional FCM algorithm, including the instability of random selecting of initial center and the limitation of the data separation or the size of clusters. The advanced FCM algorithm combines the distance with density and improves the objective function so that the performance of the algorithm can be improved. The experimental results show that the proposed FCM algorithm requires fewer iterations yet provides higher accuracy than the traditional FCM algorithm. The advanced algorithm is applied to the influence of stars' box-office data, and the classification accuracy of the first class stars achieves 92.625%.
基金This work was supported in part by the National Science Foundation of U.S.A.under Grant Nos.~CNS 2128378,CNS 2107014,CNS 1824440,CNS 1828363,CNS 1757533,CNS 1629746,and CNS 1651947.
文摘The volume of information that needs to be processed in big data clusters increases rapidly nowadays. It is critical to execute the data analysis in a time-efficient manner. However, simply adding more computation resources may not speed up the data analysis significantly. The data analysis jobs usually consist of multiple stages which are organized as a directed acyclic graph (DAG). The precedence relationships between stages cause scheduling challenges. General DAG scheduling is a well-known NP-hard problem. Moreover, we observe that in some parallel computing frameworks such as Spark, the execution of a stage in DAG contains multiple phases that use different resources. We notice that carefully arranging the execution of those resources in pipeline can reduce their idle time and improve the average resource utilization. Therefore, we propose a resource pipeline scheme with the objective of minimizing the job makespan. For perfectly parallel stages, we propose a contention-free scheduler with detailed theoretical analysis. Moreover, we extend the contention-free scheduler for three-phase stages, considering the computation phase of some stages can be partitioned. Additionally, we are aware that job stages in real-world applications are usually not perfectly parallel. We need to frequently adjust the parallelism levels during the DAG execution. Considering reinforcement learning (RL) techniques can adjust the scheduling policy on the fly, we investigate a scheduler based on RL for online arrival jobs. The RL-based scheduler can adjust the resource contention adaptively. We evaluate both contention-free and RL-based schedulers on a Spark cluster. In the evaluation, a real-world cluster trace dataset is used to simulate different DAG styles. Evaluation results show that our pipelined scheme can significantly improve CPU and network utilization.