How to determine an appropriate number of clusters is very important when implementing a specific clustering algorithm, like c-means, fuzzy c-means (FCM). In the literature, most cluster validity indices are origina...How to determine an appropriate number of clusters is very important when implementing a specific clustering algorithm, like c-means, fuzzy c-means (FCM). In the literature, most cluster validity indices are originated from partition or geometrical property of the data set. In this paper, the authors developed a novel cluster validity index for FCM, based on the optimality test of FCM. Unlike the previous cluster validity indices, this novel cluster validity index is inherent in FCM itself. Comparison experiments show that the stability index can be used as cluster validity index for the fuzzy c-means.展开更多
Purpose-The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities.The purpose of th...Purpose-The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities.The purpose of this paper is to propose a new cluster validity index(ARSD index)that works well on all types of data sets.Design/methodology/approach-The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster.A novel penalty function is proposed for determining the distinctness measure of clusters.Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index.The values of the six indices are computed for all nc ranging from(nc_(min),nc_(max))to obtain the optimal number of clusters present in a data set.The data sets used in the experiments include shaped,Gaussian-like and real data sets.Findings-Through extensive experimental study,it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices.This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results.Originality/value-The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.展开更多
For many clustering algorithms,it is very important to determine an appropriate number of clusters,which is called cluster validity problem.In this paper,a new clustering validity assessment index is proposed based on...For many clustering algorithms,it is very important to determine an appropriate number of clusters,which is called cluster validity problem.In this paper,a new clustering validity assessment index is proposed based on a novel method to select the margin point between two clusters for in-ter-cluster similarity more accurately,and provides an improved scatter function for intra-cluster similarity.Simulation results show the effectiveness of the proposed index on the data sets under consideration regardless of the choice of a clustering algorithm.展开更多
Unsupervised clustering and clustering validity are used as essential instruments of data analytics.Despite clustering being realized under uncertainty,validity indices do not deliver any quantitative evaluation of th...Unsupervised clustering and clustering validity are used as essential instruments of data analytics.Despite clustering being realized under uncertainty,validity indices do not deliver any quantitative evaluation of the uncertainties in the suggested partitionings.Also,validity measures may be biased towards the underlying clustering method.Moreover,neglecting a confidence requirement may result in over-partitioning.In the absence of an error estimate or a confidence parameter,probable clustering errors are forwarded to the later stages of the system.Whereas,having an uncertainty margin of the projected labeling can be very fruitful for many applications such as machine learning.Herein,the validity issue was approached through estimation of the uncertainty and a novel low complexity index proposed for fuzzy clustering.It involves only uni-dimensional membership weights,regardless of the data dimension,stipulates no specific distribution,and is independent of the underlying similarity measure.Inclusive tests and comparisons returned that it can reliably estimate the optimum number of partitions under different data distributions,besides behaving more robust to over partitioning.Also,in the comparative correlation analysis between true clustering error rates and some known internal validity indices,the suggested index exhibited the highest strong correlations.This relationship has been also proven stable through additional statistical acceptance tests.Thus the provided relative uncertainty measure can be used as a probable error estimate in the clustering as well.Besides,it is the only method known that can exclusively identify data points in dubiety and is adjustable according to the required confidence level.展开更多
Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the ...Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.展开更多
Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling...Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling better service to be provided to them.Performing power load profile clustering is the basis for mining the users’electricity consumption behavior.By examining the complexity,randomness,and uncertainty of the users’electricity consumption behavior,this paper proposes an ensemble clustering method to analyze this behavior.First,principle component analysis(PCA)is used to reduce the dimensions of the data.Subsequently,the single clustering method is used,and the majority is selected for integrated clustering.As a result,the users’electricity consumption behavior is classified into different modes,and their characteristics are analyzed in detail.This paper examines the electricity power data of 19 real users in China for simulation purposes.This manuscript provides a thorough analysis along with suggestions for the users’weekly electricity consumption behavior.The results verify the effectiveness of the proposed method.展开更多
Web log mining is analysis of web log files with web page sequences. Discovering user access patterns from web access are necessary for building adaptive web servers, to improve e-commerce, to carry out cross-marketin...Web log mining is analysis of web log files with web page sequences. Discovering user access patterns from web access are necessary for building adaptive web servers, to improve e-commerce, to carry out cross-marketing, for web personalization, to predict web access sequence etc. In this paper, a new agglomerative clustering technique is proposed to identify users with similar interest, and to determine the motivation for visiting a website. Using this approach, web usage mining is done through different stages namely data cleaning, preprocessing, pattern discovery and pattern analysis. Results are given to explain how this approach produces tight usage clusters than the existing web usage mining techniques. Rather than traditional distance based clustering, the similarity measure is considered during clustering process in order to reduce computational complexity. This paper also deals with the problem of assessing the quality of user session clusters and cluster validity is measured by using statistical test, which measures the distances of clusters distributions to infer their dissimilarity and distinguish level. Using such statistical measures, it is proved that cluster accuracy is improved to the extent of 0.83, over existing k-means clustering with validity measure 0.26, FCM (Fuzzy C Means) clustering with validity measure 0.56. Rough set based clustering with validity measure 0.54 Generation of dense clusters is essential for finding interesting patterns needed for further mining and analysis.展开更多
To the problem that it is hard to determine the clustering number and the abnormal points by using the clustering validity function, an effective clustering partition model based on the genetic algorithm is built in t...To the problem that it is hard to determine the clustering number and the abnormal points by using the clustering validity function, an effective clustering partition model based on the genetic algorithm is built in this paper. The solution to the problem is formed by the combination of the clustering partition and the encoding samples, and the fitness function is defined by the distances among and within clusters. The clustering number and the samples in each cluster are determined and the abnormal points are distinguished by implementing the triple random crossover operator and the mutation. Based on the known sample data, the results of the novel method and the clustering validity function are compared. Numerical experiments are given and the results show that the novel method is more effective.展开更多
Classification systems such as Slope Mass Rating(SMR) are currently being used to undertake slope stability analysis. In SMR classification system, data is allocated to certain classes based on linguistic and experien...Classification systems such as Slope Mass Rating(SMR) are currently being used to undertake slope stability analysis. In SMR classification system, data is allocated to certain classes based on linguistic and experience-based criteria. In order to eliminate linguistic criteria resulted from experience-based judgments and account for uncertainties in determining class boundaries developed by SMR system,the system classification results were corrected using two clustering algorithms, namely K-means and fuzzy c-means(FCM), for the ratings obtained via continuous and discrete functions. By applying clustering algorithms in SMR classification system, no in-advance experience-based judgment was made on the number of extracted classes in this system, and it was only after all steps of the clustering algorithms were accomplished that new classification scheme was proposed for SMR system under different failure modes based on the ratings obtained via continuous and discrete functions. The results of this study showed that, engineers can achieve more reliable and objective evaluations over slope stability by using SMR system based on the ratings calculated via continuous and discrete functions.展开更多
Among the available clustering algorithms in data mining, the CLOPE algorithm attracts much more attention with its high speed and good performance. However, the proper choice of some parameters in the CLOPE algorithm...Among the available clustering algorithms in data mining, the CLOPE algorithm attracts much more attention with its high speed and good performance. However, the proper choice of some parameters in the CLOPE algorithm directly affects the validity of the clustering results, which is still an open issue. For this purpose, this paper proposes a fuzzy CLOPE algorithm, and presents a method for the optimal parameter choice by defining a modified partition fuzzy degree as a clustering validity function. The experimental results with real data set illustrate the effectiveness of the proposed fuzzy CLOPE algorithm and optimal parameter choice method based on the modified partition fuzzy degree.展开更多
Gap statistic is a well-known index of clustering validity, but its realization is difficult to be comprehended and accurately determined. A direct method is presented to improve the performance of the Gap statistic, ...Gap statistic is a well-known index of clustering validity, but its realization is difficult to be comprehended and accurately determined. A direct method is presented to improve the performance of the Gap statistic, which applies the two-order difference of within-cluster dispersion to replace the constructed null reference distribution in the Gap statistic. Hence, the realization of the Gap statistic becomes easy and is reformulated, and its uncertainty in applications is reduced. Also, the limitation of the Gap statistic is analyzed by two typical examples, that is, the Gap statistic is difficult to be applied to the dataset that contains strong-overlap or uneven-density clusters. Experiments verify the usefulness of the proposed method.展开更多
Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,...Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,which may not capture all the features of the data.This paper proposes a novel method for time series clustering based on evolutionary multi-tasking optimization,termed i-MFEA,which uses an improved multifactorial evolutionary algorithm to optimize multiple clustering tasks simultaneously,each with a different validity index or distance measure.Therefore,i-MFEA can produce diverse and robust clustering solutions that satisfy various preferences of decision-makers.Experiments on two artificial datasets show that i-MFEA outperforms single-objective evolutionary algorithms and traditional clustering methods in terms of convergence speed and clustering quality.The paper also discusses how i-MFEA can address two long-standing issues in time series clustering:the choice of appropriate similarity measure and the number of clusters.展开更多
With the increasingly widespread of advanced metering infrastructure,electric load clustering is becoming more essential for its great potential in analytics of consumers’energy consumption patterns and preference th...With the increasingly widespread of advanced metering infrastructure,electric load clustering is becoming more essential for its great potential in analytics of consumers’energy consumption patterns and preference through data mining.Moreover,a variety of electric load clustering techniques have been put into practice to obtain the distribution of load data,observe the characteristics of load clusters,and classify the components of the total load.This can give rise to the development of related techniques and research in the smart grid,such as demand-side response.This paper summarizes the basic concepts and the general process in electric load clustering.Several similarity measurements and five major categories in electric load clustering are then comprehensively summarized along with their advantages and disadvantages.Afterwards,eight indices widely used to evaluate the validity of electric load clustering are described.Finally,vital applications are discussed thoroughly along with future trends including the tariff design,anomaly detection,load forecasting,data security and big data,etc.展开更多
The upper bound of the optimal number of clusters in clustering algorithm is studied in this paper. A new method is proposed to solve this issue. This method shows that the rule cmax≤N^(1/N), which is popular in curr...The upper bound of the optimal number of clusters in clustering algorithm is studied in this paper. A new method is proposed to solve this issue. This method shows that the rule cmax≤N^(1/N), which is popular in current papers, is reasonable in some sense. The above conclusion is tested and analyzed by some typical examples in the literature, which demonstrates the validity of the new method.展开更多
基金Supported by the National Natural Science Foundation of China under Grant No. 60303014,
文摘How to determine an appropriate number of clusters is very important when implementing a specific clustering algorithm, like c-means, fuzzy c-means (FCM). In the literature, most cluster validity indices are originated from partition or geometrical property of the data set. In this paper, the authors developed a novel cluster validity index for FCM, based on the optimality test of FCM. Unlike the previous cluster validity indices, this novel cluster validity index is inherent in FCM itself. Comparison experiments show that the stability index can be used as cluster validity index for the fuzzy c-means.
文摘Purpose-The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities.The purpose of this paper is to propose a new cluster validity index(ARSD index)that works well on all types of data sets.Design/methodology/approach-The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster.A novel penalty function is proposed for determining the distinctness measure of clusters.Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index.The values of the six indices are computed for all nc ranging from(nc_(min),nc_(max))to obtain the optimal number of clusters present in a data set.The data sets used in the experiments include shaped,Gaussian-like and real data sets.Findings-Through extensive experimental study,it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices.This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results.Originality/value-The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.
文摘For many clustering algorithms,it is very important to determine an appropriate number of clusters,which is called cluster validity problem.In this paper,a new clustering validity assessment index is proposed based on a novel method to select the margin point between two clusters for in-ter-cluster similarity more accurately,and provides an improved scatter function for intra-cluster similarity.Simulation results show the effectiveness of the proposed index on the data sets under consideration regardless of the choice of a clustering algorithm.
文摘Unsupervised clustering and clustering validity are used as essential instruments of data analytics.Despite clustering being realized under uncertainty,validity indices do not deliver any quantitative evaluation of the uncertainties in the suggested partitionings.Also,validity measures may be biased towards the underlying clustering method.Moreover,neglecting a confidence requirement may result in over-partitioning.In the absence of an error estimate or a confidence parameter,probable clustering errors are forwarded to the later stages of the system.Whereas,having an uncertainty margin of the projected labeling can be very fruitful for many applications such as machine learning.Herein,the validity issue was approached through estimation of the uncertainty and a novel low complexity index proposed for fuzzy clustering.It involves only uni-dimensional membership weights,regardless of the data dimension,stipulates no specific distribution,and is independent of the underlying similarity measure.Inclusive tests and comparisons returned that it can reliably estimate the optimum number of partitions under different data distributions,besides behaving more robust to over partitioning.Also,in the comparative correlation analysis between true clustering error rates and some known internal validity indices,the suggested index exhibited the highest strong correlations.This relationship has been also proven stable through additional statistical acceptance tests.Thus the provided relative uncertainty measure can be used as a probable error estimate in the clustering as well.Besides,it is the only method known that can exclusively identify data points in dubiety and is adjustable according to the required confidence level.
基金Supported by the National Natural Science Foundation of China(61139002)~~
文摘Partition-based clustering with weighted feature is developed in the framework of shadowed sets. The objects in the core and boundary regions, generated by shadowed sets-based clustering, have different impact on the prototype of each cluster. By integrating feature weights, a formula for weight calculation is introduced to the clustering algorithm. The selection of weight exponent is crucial for good result and the weights are updated iteratively with each partition of clusters. The convergence of the weighted algorithms is given, and the feasible cluster validity indices of data mining application are utilized. Experimental results on both synthetic and real-life numerical data with different feature weights demonstrate that the weighted algorithm is better than the other unweighted algorithms.
基金supported by the State Grid Science and Technology Project (No.5442AI90009)Natural Science Foundation of China (No. 6170337)
文摘Due to the increase in the number of smart meter devices,a power grid generates a large amount of data.Analyzing the data can help in understanding the users’electricity consumption behavior and demands;thus,enabling better service to be provided to them.Performing power load profile clustering is the basis for mining the users’electricity consumption behavior.By examining the complexity,randomness,and uncertainty of the users’electricity consumption behavior,this paper proposes an ensemble clustering method to analyze this behavior.First,principle component analysis(PCA)is used to reduce the dimensions of the data.Subsequently,the single clustering method is used,and the majority is selected for integrated clustering.As a result,the users’electricity consumption behavior is classified into different modes,and their characteristics are analyzed in detail.This paper examines the electricity power data of 19 real users in China for simulation purposes.This manuscript provides a thorough analysis along with suggestions for the users’weekly electricity consumption behavior.The results verify the effectiveness of the proposed method.
文摘Web log mining is analysis of web log files with web page sequences. Discovering user access patterns from web access are necessary for building adaptive web servers, to improve e-commerce, to carry out cross-marketing, for web personalization, to predict web access sequence etc. In this paper, a new agglomerative clustering technique is proposed to identify users with similar interest, and to determine the motivation for visiting a website. Using this approach, web usage mining is done through different stages namely data cleaning, preprocessing, pattern discovery and pattern analysis. Results are given to explain how this approach produces tight usage clusters than the existing web usage mining techniques. Rather than traditional distance based clustering, the similarity measure is considered during clustering process in order to reduce computational complexity. This paper also deals with the problem of assessing the quality of user session clusters and cluster validity is measured by using statistical test, which measures the distances of clusters distributions to infer their dissimilarity and distinguish level. Using such statistical measures, it is proved that cluster accuracy is improved to the extent of 0.83, over existing k-means clustering with validity measure 0.26, FCM (Fuzzy C Means) clustering with validity measure 0.56. Rough set based clustering with validity measure 0.54 Generation of dense clusters is essential for finding interesting patterns needed for further mining and analysis.
文摘To the problem that it is hard to determine the clustering number and the abnormal points by using the clustering validity function, an effective clustering partition model based on the genetic algorithm is built in this paper. The solution to the problem is formed by the combination of the clustering partition and the encoding samples, and the fitness function is defined by the distances among and within clusters. The clustering number and the samples in each cluster are determined and the abnormal points are distinguished by implementing the triple random crossover operator and the mutation. Based on the known sample data, the results of the novel method and the clustering validity function are compared. Numerical experiments are given and the results show that the novel method is more effective.
文摘Classification systems such as Slope Mass Rating(SMR) are currently being used to undertake slope stability analysis. In SMR classification system, data is allocated to certain classes based on linguistic and experience-based criteria. In order to eliminate linguistic criteria resulted from experience-based judgments and account for uncertainties in determining class boundaries developed by SMR system,the system classification results were corrected using two clustering algorithms, namely K-means and fuzzy c-means(FCM), for the ratings obtained via continuous and discrete functions. By applying clustering algorithms in SMR classification system, no in-advance experience-based judgment was made on the number of extracted classes in this system, and it was only after all steps of the clustering algorithms were accomplished that new classification scheme was proposed for SMR system under different failure modes based on the ratings obtained via continuous and discrete functions. The results of this study showed that, engineers can achieve more reliable and objective evaluations over slope stability by using SMR system based on the ratings calculated via continuous and discrete functions.
基金Supported by the National Natural Science Foundation of China (No.60202004).
文摘Among the available clustering algorithms in data mining, the CLOPE algorithm attracts much more attention with its high speed and good performance. However, the proper choice of some parameters in the CLOPE algorithm directly affects the validity of the clustering results, which is still an open issue. For this purpose, this paper proposes a fuzzy CLOPE algorithm, and presents a method for the optimal parameter choice by defining a modified partition fuzzy degree as a clustering validity function. The experimental results with real data set illustrate the effectiveness of the proposed fuzzy CLOPE algorithm and optimal parameter choice method based on the modified partition fuzzy degree.
基金National Natural Science Foundation of China(No.60572065, 60772080, 60532020)
文摘Gap statistic is a well-known index of clustering validity, but its realization is difficult to be comprehended and accurately determined. A direct method is presented to improve the performance of the Gap statistic, which applies the two-order difference of within-cluster dispersion to replace the constructed null reference distribution in the Gap statistic. Hence, the realization of the Gap statistic becomes easy and is reformulated, and its uncertainty in applications is reduced. Also, the limitation of the Gap statistic is analyzed by two typical examples, that is, the Gap statistic is difficult to be applied to the dataset that contains strong-overlap or uneven-density clusters. Experiments verify the usefulness of the proposed method.
基金supported by the Open Project of Xiangjiang Laboratory(No.22XJ02003)the National Natural Science Foundation of China(No.62122093).
文摘Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,which may not capture all the features of the data.This paper proposes a novel method for time series clustering based on evolutionary multi-tasking optimization,termed i-MFEA,which uses an improved multifactorial evolutionary algorithm to optimize multiple clustering tasks simultaneously,each with a different validity index or distance measure.Therefore,i-MFEA can produce diverse and robust clustering solutions that satisfy various preferences of decision-makers.Experiments on two artificial datasets show that i-MFEA outperforms single-objective evolutionary algorithms and traditional clustering methods in terms of convergence speed and clustering quality.The paper also discusses how i-MFEA can address two long-standing issues in time series clustering:the choice of appropriate similarity measure and the number of clusters.
基金supported in part by the National Natural Science Foundation of China(No.51877189)National Natural Science Foundation of China Joint Program on Smart Grid(No.U2066601)Young Elite Scientists Sponsorship Program by China Association of Science and Technology(No.2018QNRC001)。
文摘With the increasingly widespread of advanced metering infrastructure,electric load clustering is becoming more essential for its great potential in analytics of consumers’energy consumption patterns and preference through data mining.Moreover,a variety of electric load clustering techniques have been put into practice to obtain the distribution of load data,observe the characteristics of load clusters,and classify the components of the total load.This can give rise to the development of related techniques and research in the smart grid,such as demand-side response.This paper summarizes the basic concepts and the general process in electric load clustering.Several similarity measurements and five major categories in electric load clustering are then comprehensively summarized along with their advantages and disadvantages.Afterwards,eight indices widely used to evaluate the validity of electric load clustering are described.Finally,vital applications are discussed thoroughly along with future trends including the tariff design,anomaly detection,load forecasting,data security and big data,etc.
基金This work was supported by the National Natural Science Foundation of China (Grant Nos. 69872003 and 40035010)
文摘The upper bound of the optimal number of clusters in clustering algorithm is studied in this paper. A new method is proposed to solve this issue. This method shows that the rule cmax≤N^(1/N), which is popular in current papers, is reasonable in some sense. The above conclusion is tested and analyzed by some typical examples in the literature, which demonstrates the validity of the new method.