With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow e...With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow explosively.These multi-source heterogeneous data have data differences,which lead to data variation in the process of transmission and preservation,thus forming the bad information of incomplete data.Therefore,the research on data integrity has become an urgent task.This paper is based on the characteristics of random chance and the Spatio-temporal difference of the system.According to the characteristics and data sources of the massive data generated by power equipment,the fuzzy mining model of power equipment data is established,and the data is divided into numerical and non-numerical data based on numerical data.Take the text data of power equipment defects as the mining material.Then,the Apriori algorithm based on an array is used to mine deeply.The strong association rules in incomplete data of power equipment are obtained and analyzed.From the change trend of NRMSE metrics and classification accuracy,most of the filling methods combined with the two frameworks in this method usually show a relatively stable filling trend,and will not fluctuate greatly with the growth of the missing rate.The experimental results show that the proposed algorithm model can effectively improve the filling effect of the existing filling methods on most data sets,and the filling effect fluctuates greatly with the increase of the missing rate,that is,with the increase of the missing rate,the improvement effect of the model for the existing filling methods is higher than 4.3%.Through the incomplete data clustering technology studied in this paper,a more innovative state assessment of smart grid reliability operation is carried out,which has good research value and reference significance.展开更多
Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and impleme...Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and implemented for all most all data types. The quality of results of cluster analysis mainly depends on the clustering algorithm used in the analysis. Architecture of a versatile, less user dependent, dynamic and scalable data clustering machine is presented. The machine selects for analysis, the best available data clustering algorithm on the basis of the credentials of the data and previously used domain knowledge. The domain knowledge is updated on completion of each session of data analysis.展开更多
Big data clustering plays an important role in the field of data processing in wireless sensor networks.However,there are some problems such as poor clustering effect and low Jaccard coefficient.This paper proposes a ...Big data clustering plays an important role in the field of data processing in wireless sensor networks.However,there are some problems such as poor clustering effect and low Jaccard coefficient.This paper proposes a novel big data clustering optimization method based on intuitionistic fuzzy set distance and particle swarm optimization for wireless sensor networks.This method combines principal component analysis method and information entropy dimensionality reduction to process big data and reduce the time required for data clustering.A new distance measurement method of intuitionistic fuzzy sets is defined,which not only considers membership and non-membership information,but also considers the allocation of hesitancy to membership and non-membership,thereby indirectly introducing hesitancy into intuitionistic fuzzy set distance.The intuitionistic fuzzy kernel clustering algorithm is used to cluster big data,and particle swarm optimization is introduced to optimize the intuitionistic fuzzy kernel clustering method.The optimized algorithm is used to obtain the optimization results of wireless sensor network big data clustering,and the big data clustering is realized.Simulation results show that the proposed method has good clustering effect by comparing with other state-of-the-art clustering methods.展开更多
The traditional methods are easy to generate a large number of fake samples or data loss when classifying unbalanced data.Therefore,this paper proposes a novel DBSCAN(density-based spatial clustering of application wi...The traditional methods are easy to generate a large number of fake samples or data loss when classifying unbalanced data.Therefore,this paper proposes a novel DBSCAN(density-based spatial clustering of application with noise)for data clustering.The density-based DBSCAN clustering decomposition algorithm is applied to most classes of unbalanced data sets,which reduces the advantage of most class samples without data loss.The algorithm uses different distance measurements for disordered and ordered classification data,and assigns corresponding weights with average entropy.The experimental results show that the new algorithm has better clustering effect than other advanced clustering algorithms on both artificial and real data sets.展开更多
Harmony Search(HS)algorithm is highly effective in solving a wide range of real-world engineering optimization problems.However,it still has the problems such as being prone to local optima,low optimization accuracy,a...Harmony Search(HS)algorithm is highly effective in solving a wide range of real-world engineering optimization problems.However,it still has the problems such as being prone to local optima,low optimization accuracy,and low search efficiency.To address the limitations of the HS algorithm,a novel approach called the Dual-Memory Dynamic Search Harmony Search(DMDS-HS)algorithm is introduced.The main innovations of this algorithm are as follows:Firstly,a dual-memory structure is introduced to rank and hierarchically organize the harmonies in the harmony memory,creating an effective and selectable trust region to reduce approach blind searching.Furthermore,the trust region is dynamically adjusted to improve the convergence of the algorithm while maintaining its global search capability.Secondly,to boost the algorithm’s convergence speed,a phased dynamic convergence domain concept is introduced to strategically devise a global random search strategy.Lastly,the algorithm constructs an adaptive parameter adjustment strategy to adjust the usage probability of the algorithm’s search strategies,which aim to rationalize the abilities of exploration and exploitation of the algorithm.The results tested on the Computational Experiment Competition on 2017(CEC2017)test function set show that DMDS-HS outperforms the other nine HS algorithms and the other four state-of-the-art algorithms in terms of diversity,freedom from local optima,and solution accuracy.In addition,applying DMDS-HS to data clustering problems,the results show that it exhibits clustering performance that exceeds the other seven classical clustering algorithms,which verifies the effectiveness and reliability of DMDS-HS in solving complex data clustering problems.展开更多
Purpose-The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents,which is useful for achieving the robust tweets data cl...Purpose-The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents,which is useful for achieving the robust tweets data clustering results.Design/methodology/approach-Let“N”be the number of tweets documents for the topics extraction.Unwanted texts,punctuations and other symbols are removed,tokenization and stemming operations are performed in the initial tweets pre-processing step.Bag-of-features are determined for the tweets;later tweets are modelled with the obtained bag-of-features during the process of topics extraction.Approximation of topics features are extracted for every tweet document.These set of topics features of N documents are treated as multi-viewpoints.The key idea of the proposed work is to use multi-viewpoints in the similarity features computation.The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents(here N 55)and corresponding documents are defined in projected space with five viewpoints,say,v_(1),v_(2),v_(3),v4,and v5.For example,similarity features between two documents(viewpoints v_(1),and v_(2))are computed concerning the other three multi-viewpoints(v_(3),v4,and v5),unlike a single viewpoint in traditional cosine metric.Findings-Healthcare problems with tweets data.Topic models play a crucial role in the classification of health-related tweets with finding topics(or health clusters)instead of finding term frequency and inverse document frequency(TF-IDF)for unlabelled tweets.Originality/value-Topic models play a crucial role in the classification of health-related tweets with finding topics(or health clusters)instead of finding TF-IDF for unlabelled tweets.展开更多
Finding clusters based on density represents a significant class of clustering algorithms.These methods can discover clusters of various shapes and sizes.The most studied algorithm in this class is theDensity-Based Sp...Finding clusters based on density represents a significant class of clustering algorithms.These methods can discover clusters of various shapes and sizes.The most studied algorithm in this class is theDensity-Based Spatial Clustering of Applications with Noise(DBSCAN).It identifies clusters by grouping the densely connected objects into one group and discarding the noise objects.It requires two input parameters:epsilon(fixed neighborhood radius)and MinPts(the lowest number of objects in epsilon).However,it can’t handle clusters of various densities since it uses a global value for epsilon.This article proposes an adaptation of the DBSCAN method so it can discover clusters of varied densities besides reducing the required number of input parameters to only one.Only user input in the proposed method is the MinPts.Epsilon on the other hand,is computed automatically based on statistical information of the dataset.The proposed method finds the core distance for each object in the dataset,takes the average of these distances as the first value of epsilon,and finds the clusters satisfying this density level.The remaining unclustered objects will be clustered using a new value of epsilon that equals the average core distances of unclustered objects.This process continues until all objects have been clustered or the remaining unclustered objects are less than 0.006 of the dataset’s size.The proposed method requires MinPts only as an input parameter because epsilon is computed from data.Benchmark datasets were used to evaluate the effectiveness of the proposed method that produced promising results.Practical experiments demonstrate that the outstanding ability of the proposed method to detect clusters of different densities even if there is no separation between them.The accuracy of the method ranges from 92%to 100%for the experimented datasets.展开更多
The data clustering problem consists in dividing a data set into prescribed groups of homogeneous data.This is an NP-hard problem that can be relaxed in the spectral graph theory,where the optimal cuts of a graph are ...The data clustering problem consists in dividing a data set into prescribed groups of homogeneous data.This is an NP-hard problem that can be relaxed in the spectral graph theory,where the optimal cuts of a graph are related to the eigenvalues of graph 1-Laplacian.In this paper,we first give new notations to describe the paths,among critical eigenvectors of the graph 1-Laplacian,realizing sets with prescribed genus.We introduce the pseudo-orthogonality to characterize m_(3)(G),a special eigenvalue for the graph 1-Laplacian.Furthermore,we use it to give an upper bound for the third graph Cheeger constant h_(3)(G),that is,h_(3)(G)≤m_(3)(G).This is a first step for proving that the k-th Cheeger constant is the minimum of the 1-Laplacian Raylegh quotient among vectors that are pseudo-orthogonal to the vectors realizing the previous k−1 Cheeger constants.Eventually,we apply these results to give a method and a numerical algorithm to compute m3(G),based on a generalized inverse power method.展开更多
This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical clus...This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical cluster analysis methods.In our proposed algorithm,improvements in the efficiency of an agent operation were achieved,and a new function "cluster condensation" was added.Our proposed algorithm is a processing method by which a cluster size is reduced by uniting similar objects and incorporating them into the cluster condensation.Compared with classical cluster analysis methods,the number of steps required to complete the clustering can be suppressed to 1% or less by performing this procedure,and the dispersion of the result can also be reduced.Moreover,our clustering algorithm has the advantage of being possible even in a small-field cluster condensation.In addition,the number of objects that exist in the field decreases because the cluster condenses;therefore,it becomes possible to add an object to a space that has become empty.In other words,first,the majority of data is put on standby.They are then clustered,gradually adding parts of the standby data to the clustering data.The method can be adopted for a large amount of data.Numerical experiments confirmed that our proposed algorithm can theoretically applied to an unrestricted volume of data.展开更多
The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between c...The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between clustering aggregation and the problem of correlation clustering.The best deterministic approximation algorithm was provided for the variation of the correlation of clustering problem,and showed how sampling can be used to scale the algorithms for large datasets.An extensive empirical evaluation was given for the usefulness of the problem and the solutions.The results show that this method achieves more than 50% reduction in the running time without sacrificing the quality of the clustering.展开更多
Graph-theoretical approaches have been widely used for data clustering and image segmentation recently. The goal of data clustering is to discover the underlying distribution and structural information of the given da...Graph-theoretical approaches have been widely used for data clustering and image segmentation recently. The goal of data clustering is to discover the underlying distribution and structural information of the given data, while image segmentation is to partition an image into several non-overlapping regions. Therefore, two popular graph-theoretical clustering methods are analyzed, including the directed tree based data clustering and the minimum spanning tree based image segmentation. There are two contributions: (1) To improve the directed tree based data clustering for image segmentation, (2) To improve the minimum spanning tree based image segmentation for data clustering. The extensive experiments using artificial and real-world data indicate that the improved directed tree based image segmentation can partition images well by preserving enough details, and the improved minimum spanning tree based data clustering can well cluster data in manifold structure.展开更多
This paper proposes a clustering technique that minimizes the need for subjective human intervention and is based on elements of rough set theory (RST). The proposed algorithm is unified in its approach to clusterin...This paper proposes a clustering technique that minimizes the need for subjective human intervention and is based on elements of rough set theory (RST). The proposed algorithm is unified in its approach to clustering and makes use of both local and global data properties to obtain clustering solutions. It handles single-type and mixed attribute data sets with ease. The results from three data sets of single and mixed attribute types are used to illustrate the technique and establish its efficiency.展开更多
Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets...Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets.This paper focuses on cluster analysis based on neutrosophic set implication,i.e.,a k-means algorithm with a threshold-based clustering technique.This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm.To evaluate the validity of the proposed method,several validity measures and validity indices are applied to the Iris dataset(from the University of California,Irvine,Machine Learning Repository)along with k-means and threshold-based clustering algorithms.The proposed method results in more segregated datasets with compacted clusters,thus achieving higher validity indices.The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and threshold-based clustering algorithms.展开更多
Fuzzy clustering theory is widely used in data mining of full-face tunnel boring machine.However,the traditional fuzzy clustering algorithm based on objective function is difficult to effectively cluster functional da...Fuzzy clustering theory is widely used in data mining of full-face tunnel boring machine.However,the traditional fuzzy clustering algorithm based on objective function is difficult to effectively cluster functional data.We propose a new Fuzzy clustering algorithm,namely FCM-ANN algorithm.The algorithm replaces the clustering prototype of the FCM algorithm with the predicted value of the artificial neural network.This makes the algorithm not only satisfy the clustering based on the traditional similarity criterion,but also can effectively cluster the functional data.In this paper,we first use the t-test as an evaluation index and apply the FCM-ANN algorithm to the synthetic datasets for validity testing.Then the algorithm is applied to TBM operation data and combined with the crossvalidation method to predict the tunneling speed.The predicted results are evaluated by RMSE and R^(2).According to the experimental results on the synthetic datasets,we obtain the relationship among the membership threshold,the number of samples,the number of attributes and the noise.Accordingly,the datasets can be effectively adjusted.Applying the FCM-ANN algorithm to the TBM operation data can accurately predict the tunneling speed.The FCM-ANN algorithm has improved the traditional fuzzy clustering algorithm,which can be used not only for the prediction of tunneling speed of TBM but also for clustering or prediction of other functional data.展开更多
This paper proposes a distributed dynamic k-medoid clustering algorithm for wireless sensor networks (WSNs), DDKCAWSN. Different from node-clustering algorithms and protocols for WSNs, the algorithm focuses on clust...This paper proposes a distributed dynamic k-medoid clustering algorithm for wireless sensor networks (WSNs), DDKCAWSN. Different from node-clustering algorithms and protocols for WSNs, the algorithm focuses on clustering data in the network. By sending the sink clustered data instead of practical ones, the algorithm can greatly reduce the size and the time of data communication, and further save the energy of the nodes in the network and prolong the system lifetime. Moreover, the algorithm improves the accuracy of the clustered data dynamically by updating the clusters periodically such as each day. Simulation results demonstrate the effectiveness of our approach for different metrics.展开更多
Classical survival analysis assumes all subjects will experience the event of interest, but in some cases, a portion of the population may never encounter the event. These survival methods further assume independent s...Classical survival analysis assumes all subjects will experience the event of interest, but in some cases, a portion of the population may never encounter the event. These survival methods further assume independent survival times, which is not valid for honey bees, which live in nests. The study introduces a semi-parametric marginal proportional hazards mixture cure (PHMC) model with exchangeable correlation structure, using generalized estimating equations for survival data analysis. The model was tested on clustered right-censored bees survival data with a cured fraction, where two bee species were subjected to different entomopathogens to test the effect of the entomopathogens on the survival of the bee species. The Expectation-Solution algorithm is used to estimate the parameters. The study notes a weak positive association between cure statuses (ρ1=0.0007) and survival times for uncured bees (ρ2=0.0890), emphasizing their importance. The odds of being uncured for A. mellifera is higher than the odds for species M. ferruginea. The bee species, A. mellifera are more susceptible to entomopathogens icipe 7, icipe 20, and icipe 69. The Cox-Snell residuals show that the proposed semiparametric PH model generally fits the data well as compared to model that assume independent correlation structure. Thus, the semi parametric marginal proportional hazards mixture cure is parsimonious model for correlated bees survival data.展开更多
Based on the analysis of features of the grid-based clustering method-clustering in quest (CLIQUE) and density-based clustering method-density-based spatial clustering of applications with noise (DBSCAN), a new cl...Based on the analysis of features of the grid-based clustering method-clustering in quest (CLIQUE) and density-based clustering method-density-based spatial clustering of applications with noise (DBSCAN), a new clustering algorithm named cooperative clustering based on grid and density (CLGRID) is presented. The new algorithm adopts an equivalent rule of regional inquiry and density unit identification. The central region of one class is calculated by the grid-based method and the margin region by a density-based method. By clustering in two phases and using only a small number of seed objects in representative units to expand the cluster, the frequency of region query can be decreased, and consequently the cost of time is reduced. The new algorithm retains positive features of both grid-based and density-based methods and avoids the difficulty of parameter searching. It can discover clusters of arbitrary shape with high efficiency and is not sensitive to noise. The application of CLGRID on test data sets demonstrates its validity and higher efficiency, which contrast with tradi- tional DBSCAN with R tree.展开更多
In conjunction with association rules for data mining, the connections between testing indices and strong and weak association rules were determined, and new derivative rules were obtained by further reasoning. Associ...In conjunction with association rules for data mining, the connections between testing indices and strong and weak association rules were determined, and new derivative rules were obtained by further reasoning. Association rules were used to analyze correlation and check consistency between indices. This study shows that the judgment obtained by weak association rules or non-association rules is more accurate and more credible than that obtained by strong association rules. When the testing grades of two indices in the weak association rules are inconsistent, the testing grades of indices are more likely to be erroneous, and the mistakes are often caused by human factors. Clustering data mining technology was used to analyze the reliability of a diagnosis, or to perform health diagnosis directly. Analysis showed that the clustering results are related to the indices selected, and that if the indices selected are more significant, the characteristics of clustering results are also more significant, and the analysis or diagnosis is more credible. The indices and diagnosis analysis function produced by this study provide a necessary theoretical foundation and new ideas for the development of hydraulic metal structure health diagnosis technology.展开更多
Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some m...Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some major problems.One way to resolve these problems and improve the k-means algorithm is the use of evolutionary algorithms in clustering.In this study,the Imperialist Competitive Algorithm(ICA) is developed and then used in the clustering process.Clustering of IRIS,Wine and CMC datasets using developed ICA and comparing them with the results of clustering by the original ICA,GA and PSO algorithms,demonstrate the improvement of Imperialist competitive algorithm.展开更多
Attempts to determine characters of astronomical objects have been one of major and vibrant activities in both astronomy and data science fields.Instead of a manual inspection,various automated systems are invented to...Attempts to determine characters of astronomical objects have been one of major and vibrant activities in both astronomy and data science fields.Instead of a manual inspection,various automated systems are invented to satisfy the need,including the classification of light curve profiles.A specific Kaggle competition,namely Photometric LSST Astronomical Time-Series Classification Challenge(PLAsTiCC),is launched to gather new ideas of tackling the abovementioned task using the data set collected from the Large Synoptic Survey Telescope(LSST)project.Almost all proposed methods fall into the supervised family with a common aim to categorize each object into one of pre-defined types.As this challenge focuses on developing a predictive model that is robust to classifying unseen data,those previous attempts similarly encounter the lack of discriminate features,since distribution of training and actual test datasets are largely different.As a result,well-known classification algorithms prove to be sub-optimal,while more complicated feature extraction techniques may help to slightly boost the predictive performance.Given such a burden,this research is set to explore an unsupervised alternative to the difficult quest,where common classifiers fail to reach the 50%accuracy mark.A clustering technique is exploited to transform the space of training data,from which a more accurate classifier can be built.In addition to a single clustering framework that provides a comparable accuracy to the front runners of supervised learning,a multiple-clustering alternative is also introduced with improved performance.In fact,it is able to yield a higher accuracy rate of 58.32%from 51.36%that is obtained using a simple clustering.For this difficult problem,it is rather good considering for those achieved by well-known models like support vector machine(SVM)with 51.80%and Naive Bayes(NB)with only 2.92%.展开更多
文摘With the rapid development of the economy,the scale of the power grid is expanding.The number of power equipment that constitutes the power grid has been very large,which makes the state data of power equipment grow explosively.These multi-source heterogeneous data have data differences,which lead to data variation in the process of transmission and preservation,thus forming the bad information of incomplete data.Therefore,the research on data integrity has become an urgent task.This paper is based on the characteristics of random chance and the Spatio-temporal difference of the system.According to the characteristics and data sources of the massive data generated by power equipment,the fuzzy mining model of power equipment data is established,and the data is divided into numerical and non-numerical data based on numerical data.Take the text data of power equipment defects as the mining material.Then,the Apriori algorithm based on an array is used to mine deeply.The strong association rules in incomplete data of power equipment are obtained and analyzed.From the change trend of NRMSE metrics and classification accuracy,most of the filling methods combined with the two frameworks in this method usually show a relatively stable filling trend,and will not fluctuate greatly with the growth of the missing rate.The experimental results show that the proposed algorithm model can effectively improve the filling effect of the existing filling methods on most data sets,and the filling effect fluctuates greatly with the increase of the missing rate,that is,with the increase of the missing rate,the improvement effect of the model for the existing filling methods is higher than 4.3%.Through the incomplete data clustering technology studied in this paper,a more innovative state assessment of smart grid reliability operation is carried out,which has good research value and reference significance.
文摘Data clustering is a significant information retrieval technique in today's data intensive society. Over the last few decades a vast variety of huge number of data clustering algorithms have been designed and implemented for all most all data types. The quality of results of cluster analysis mainly depends on the clustering algorithm used in the analysis. Architecture of a versatile, less user dependent, dynamic and scalable data clustering machine is presented. The machine selects for analysis, the best available data clustering algorithm on the basis of the credentials of the data and previously used domain knowledge. The domain knowledge is updated on completion of each session of data analysis.
基金2021 Scientific Research Funding Project of Liaoning Provincial Education Department(Research and implementation of university scientific research information platform serving the transformation of achievements).
文摘Big data clustering plays an important role in the field of data processing in wireless sensor networks.However,there are some problems such as poor clustering effect and low Jaccard coefficient.This paper proposes a novel big data clustering optimization method based on intuitionistic fuzzy set distance and particle swarm optimization for wireless sensor networks.This method combines principal component analysis method and information entropy dimensionality reduction to process big data and reduce the time required for data clustering.A new distance measurement method of intuitionistic fuzzy sets is defined,which not only considers membership and non-membership information,but also considers the allocation of hesitancy to membership and non-membership,thereby indirectly introducing hesitancy into intuitionistic fuzzy set distance.The intuitionistic fuzzy kernel clustering algorithm is used to cluster big data,and particle swarm optimization is introduced to optimize the intuitionistic fuzzy kernel clustering method.The optimized algorithm is used to obtain the optimization results of wireless sensor network big data clustering,and the big data clustering is realized.Simulation results show that the proposed method has good clustering effect by comparing with other state-of-the-art clustering methods.
文摘The traditional methods are easy to generate a large number of fake samples or data loss when classifying unbalanced data.Therefore,this paper proposes a novel DBSCAN(density-based spatial clustering of application with noise)for data clustering.The density-based DBSCAN clustering decomposition algorithm is applied to most classes of unbalanced data sets,which reduces the advantage of most class samples without data loss.The algorithm uses different distance measurements for disordered and ordered classification data,and assigns corresponding weights with average entropy.The experimental results show that the new algorithm has better clustering effect than other advanced clustering algorithms on both artificial and real data sets.
基金This work was supported by the Fund of Innovative Training Program for College Students of Guangzhou University(No.s202211078116)Guangzhou City School Joint Fund Project(No.SL2022A03J01009)+2 种基金National Natural Science Foundation of China(No.61806058)Natural Science Foundation of Guangdong Province(No.2018A030310063)Guangzhou Science and Technology Plan Project(No.201804010299).
文摘Harmony Search(HS)algorithm is highly effective in solving a wide range of real-world engineering optimization problems.However,it still has the problems such as being prone to local optima,low optimization accuracy,and low search efficiency.To address the limitations of the HS algorithm,a novel approach called the Dual-Memory Dynamic Search Harmony Search(DMDS-HS)algorithm is introduced.The main innovations of this algorithm are as follows:Firstly,a dual-memory structure is introduced to rank and hierarchically organize the harmonies in the harmony memory,creating an effective and selectable trust region to reduce approach blind searching.Furthermore,the trust region is dynamically adjusted to improve the convergence of the algorithm while maintaining its global search capability.Secondly,to boost the algorithm’s convergence speed,a phased dynamic convergence domain concept is introduced to strategically devise a global random search strategy.Lastly,the algorithm constructs an adaptive parameter adjustment strategy to adjust the usage probability of the algorithm’s search strategies,which aim to rationalize the abilities of exploration and exploitation of the algorithm.The results tested on the Computational Experiment Competition on 2017(CEC2017)test function set show that DMDS-HS outperforms the other nine HS algorithms and the other four state-of-the-art algorithms in terms of diversity,freedom from local optima,and solution accuracy.In addition,applying DMDS-HS to data clustering problems,the results show that it exhibits clustering performance that exceeds the other seven classical clustering algorithms,which verifies the effectiveness and reliability of DMDS-HS in solving complex data clustering problems.
文摘Purpose-The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents,which is useful for achieving the robust tweets data clustering results.Design/methodology/approach-Let“N”be the number of tweets documents for the topics extraction.Unwanted texts,punctuations and other symbols are removed,tokenization and stemming operations are performed in the initial tweets pre-processing step.Bag-of-features are determined for the tweets;later tweets are modelled with the obtained bag-of-features during the process of topics extraction.Approximation of topics features are extracted for every tweet document.These set of topics features of N documents are treated as multi-viewpoints.The key idea of the proposed work is to use multi-viewpoints in the similarity features computation.The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents(here N 55)and corresponding documents are defined in projected space with five viewpoints,say,v_(1),v_(2),v_(3),v4,and v5.For example,similarity features between two documents(viewpoints v_(1),and v_(2))are computed concerning the other three multi-viewpoints(v_(3),v4,and v5),unlike a single viewpoint in traditional cosine metric.Findings-Healthcare problems with tweets data.Topic models play a crucial role in the classification of health-related tweets with finding topics(or health clusters)instead of finding term frequency and inverse document frequency(TF-IDF)for unlabelled tweets.Originality/value-Topic models play a crucial role in the classification of health-related tweets with finding topics(or health clusters)instead of finding TF-IDF for unlabelled tweets.
基金The author extends his appreciation to theDeputyship forResearch&Innovation,Ministry of Education in Saudi Arabia for funding this research work through the project number(IFPSAU-2021/01/17758).
文摘Finding clusters based on density represents a significant class of clustering algorithms.These methods can discover clusters of various shapes and sizes.The most studied algorithm in this class is theDensity-Based Spatial Clustering of Applications with Noise(DBSCAN).It identifies clusters by grouping the densely connected objects into one group and discarding the noise objects.It requires two input parameters:epsilon(fixed neighborhood radius)and MinPts(the lowest number of objects in epsilon).However,it can’t handle clusters of various densities since it uses a global value for epsilon.This article proposes an adaptation of the DBSCAN method so it can discover clusters of varied densities besides reducing the required number of input parameters to only one.Only user input in the proposed method is the MinPts.Epsilon on the other hand,is computed automatically based on statistical information of the dataset.The proposed method finds the core distance for each object in the dataset,takes the average of these distances as the first value of epsilon,and finds the clusters satisfying this density level.The remaining unclustered objects will be clustered using a new value of epsilon that equals the average core distances of unclustered objects.This process continues until all objects have been clustered or the remaining unclustered objects are less than 0.006 of the dataset’s size.The proposed method requires MinPts only as an input parameter because epsilon is computed from data.Benchmark datasets were used to evaluate the effectiveness of the proposed method that produced promising results.Practical experiments demonstrate that the outstanding ability of the proposed method to detect clusters of different densities even if there is no separation between them.The accuracy of the method ranges from 92%to 100%for the experimented datasets.
基金supported by the MiUR-Dipartimenti di Eccellenza 2018–2022 grant“Sistemi distribuiti intelligenti”of Dipartimento di Ingegneria Elettrica e dell’Informazione“M.Scarano”,by the MiSE-FSC 2014–2020 grant“SUMMa:Smart Urban Mobility Management”,and by GNAMPA of INdAM.The authors would also like to thank D.A.La Manna and V.Mottola for the helpful conversations during the starting stage of this work.
文摘The data clustering problem consists in dividing a data set into prescribed groups of homogeneous data.This is an NP-hard problem that can be relaxed in the spectral graph theory,where the optimal cuts of a graph are related to the eigenvalues of graph 1-Laplacian.In this paper,we first give new notations to describe the paths,among critical eigenvectors of the graph 1-Laplacian,realizing sets with prescribed genus.We introduce the pseudo-orthogonality to characterize m_(3)(G),a special eigenvalue for the graph 1-Laplacian.Furthermore,we use it to give an upper bound for the third graph Cheeger constant h_(3)(G),that is,h_(3)(G)≤m_(3)(G).This is a first step for proving that the k-th Cheeger constant is the minimum of the 1-Laplacian Raylegh quotient among vectors that are pseudo-orthogonal to the vectors realizing the previous k−1 Cheeger constants.Eventually,we apply these results to give a method and a numerical algorithm to compute m3(G),based on a generalized inverse power method.
基金Project (No.18510132) supported by the Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research
文摘This paper presents a new algorithm for clustering a large amount of data.We improved the ant colony clustering algorithm that uses an ant’s swarm intelligence,and tried to overcome the weakness of the classical cluster analysis methods.In our proposed algorithm,improvements in the efficiency of an agent operation were achieved,and a new function "cluster condensation" was added.Our proposed algorithm is a processing method by which a cluster size is reduced by uniting similar objects and incorporating them into the cluster condensation.Compared with classical cluster analysis methods,the number of steps required to complete the clustering can be suppressed to 1% or less by performing this procedure,and the dispersion of the result can also be reduced.Moreover,our clustering algorithm has the advantage of being possible even in a small-field cluster condensation.In addition,the number of objects that exist in the field decreases because the cluster condenses;therefore,it becomes possible to add an object to a space that has become empty.In other words,first,the majority of data is put on standby.They are then clustered,gradually adding parts of the standby data to the clustering data.The method can be adopted for a large amount of data.Numerical experiments confirmed that our proposed algorithm can theoretically applied to an unrestricted volume of data.
基金Projects(60873265,60903222) supported by the National Natural Science Foundation of China Project(IRT0661) supported by the Program for Changjiang Scholars and Innovative Research Team in University of China
文摘The Circle algorithm was proposed for large datasets.The idea of the algorithm is to find a set of vertices that are close to each other and far from other vertices.This algorithm makes use of the connection between clustering aggregation and the problem of correlation clustering.The best deterministic approximation algorithm was provided for the variation of the correlation of clustering problem,and showed how sampling can be used to scale the algorithms for large datasets.An extensive empirical evaluation was given for the usefulness of the problem and the solutions.The results show that this method achieves more than 50% reduction in the running time without sacrificing the quality of the clustering.
基金Supported by the Key National Natural Science Foundation of China(61035003)~~
文摘Graph-theoretical approaches have been widely used for data clustering and image segmentation recently. The goal of data clustering is to discover the underlying distribution and structural information of the given data, while image segmentation is to partition an image into several non-overlapping regions. Therefore, two popular graph-theoretical clustering methods are analyzed, including the directed tree based data clustering and the minimum spanning tree based image segmentation. There are two contributions: (1) To improve the directed tree based data clustering for image segmentation, (2) To improve the minimum spanning tree based image segmentation for data clustering. The extensive experiments using artificial and real-world data indicate that the improved directed tree based image segmentation can partition images well by preserving enough details, and the improved minimum spanning tree based data clustering can well cluster data in manifold structure.
文摘This paper proposes a clustering technique that minimizes the need for subjective human intervention and is based on elements of rough set theory (RST). The proposed algorithm is unified in its approach to clustering and makes use of both local and global data properties to obtain clustering solutions. It handles single-type and mixed attribute data sets with ease. The results from three data sets of single and mixed attribute types are used to illustrate the technique and establish its efficiency.
文摘Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets.This paper focuses on cluster analysis based on neutrosophic set implication,i.e.,a k-means algorithm with a threshold-based clustering technique.This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm.To evaluate the validity of the proposed method,several validity measures and validity indices are applied to the Iris dataset(from the University of California,Irvine,Machine Learning Repository)along with k-means and threshold-based clustering algorithms.The proposed method results in more segregated datasets with compacted clusters,thus achieving higher validity indices.The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and threshold-based clustering algorithms.
基金supported by the National Key R&D Program of China(Grant Nos.2018YFB1700704 and 2018YFB1702502)the Study on the Key Management and Privacy Preservation in VANET,The Innovation Foundation of Science and Technology of Dalian(2018J12GX045).
文摘Fuzzy clustering theory is widely used in data mining of full-face tunnel boring machine.However,the traditional fuzzy clustering algorithm based on objective function is difficult to effectively cluster functional data.We propose a new Fuzzy clustering algorithm,namely FCM-ANN algorithm.The algorithm replaces the clustering prototype of the FCM algorithm with the predicted value of the artificial neural network.This makes the algorithm not only satisfy the clustering based on the traditional similarity criterion,but also can effectively cluster the functional data.In this paper,we first use the t-test as an evaluation index and apply the FCM-ANN algorithm to the synthetic datasets for validity testing.Then the algorithm is applied to TBM operation data and combined with the crossvalidation method to predict the tunneling speed.The predicted results are evaluated by RMSE and R^(2).According to the experimental results on the synthetic datasets,we obtain the relationship among the membership threshold,the number of samples,the number of attributes and the noise.Accordingly,the datasets can be effectively adjusted.Applying the FCM-ANN algorithm to the TBM operation data can accurately predict the tunneling speed.The FCM-ANN algorithm has improved the traditional fuzzy clustering algorithm,which can be used not only for the prediction of tunneling speed of TBM but also for clustering or prediction of other functional data.
基金the National Natural Science Foundation of China (60472047)
文摘This paper proposes a distributed dynamic k-medoid clustering algorithm for wireless sensor networks (WSNs), DDKCAWSN. Different from node-clustering algorithms and protocols for WSNs, the algorithm focuses on clustering data in the network. By sending the sink clustered data instead of practical ones, the algorithm can greatly reduce the size and the time of data communication, and further save the energy of the nodes in the network and prolong the system lifetime. Moreover, the algorithm improves the accuracy of the clustered data dynamically by updating the clusters periodically such as each day. Simulation results demonstrate the effectiveness of our approach for different metrics.
文摘Classical survival analysis assumes all subjects will experience the event of interest, but in some cases, a portion of the population may never encounter the event. These survival methods further assume independent survival times, which is not valid for honey bees, which live in nests. The study introduces a semi-parametric marginal proportional hazards mixture cure (PHMC) model with exchangeable correlation structure, using generalized estimating equations for survival data analysis. The model was tested on clustered right-censored bees survival data with a cured fraction, where two bee species were subjected to different entomopathogens to test the effect of the entomopathogens on the survival of the bee species. The Expectation-Solution algorithm is used to estimate the parameters. The study notes a weak positive association between cure statuses (ρ1=0.0007) and survival times for uncured bees (ρ2=0.0890), emphasizing their importance. The odds of being uncured for A. mellifera is higher than the odds for species M. ferruginea. The bee species, A. mellifera are more susceptible to entomopathogens icipe 7, icipe 20, and icipe 69. The Cox-Snell residuals show that the proposed semiparametric PH model generally fits the data well as compared to model that assume independent correlation structure. Thus, the semi parametric marginal proportional hazards mixture cure is parsimonious model for correlated bees survival data.
基金This project is supported by National Natural Science Foundation of China(No.50575153).
文摘Based on the analysis of features of the grid-based clustering method-clustering in quest (CLIQUE) and density-based clustering method-density-based spatial clustering of applications with noise (DBSCAN), a new clustering algorithm named cooperative clustering based on grid and density (CLGRID) is presented. The new algorithm adopts an equivalent rule of regional inquiry and density unit identification. The central region of one class is calculated by the grid-based method and the margin region by a density-based method. By clustering in two phases and using only a small number of seed objects in representative units to expand the cluster, the frequency of region query can be decreased, and consequently the cost of time is reduced. The new algorithm retains positive features of both grid-based and density-based methods and avoids the difficulty of parameter searching. It can discover clusters of arbitrary shape with high efficiency and is not sensitive to noise. The application of CLGRID on test data sets demonstrates its validity and higher efficiency, which contrast with tradi- tional DBSCAN with R tree.
基金supported by the Key Program of the National Natural Science Foundation of China(Grant No.50539010)the Special Fund for Public Welfare Industry of the Ministry of Water Resources of China(Grant No.200801019)
文摘In conjunction with association rules for data mining, the connections between testing indices and strong and weak association rules were determined, and new derivative rules were obtained by further reasoning. Association rules were used to analyze correlation and check consistency between indices. This study shows that the judgment obtained by weak association rules or non-association rules is more accurate and more credible than that obtained by strong association rules. When the testing grades of two indices in the weak association rules are inconsistent, the testing grades of indices are more likely to be erroneous, and the mistakes are often caused by human factors. Clustering data mining technology was used to analyze the reliability of a diagnosis, or to perform health diagnosis directly. Analysis showed that the clustering results are related to the indices selected, and that if the indices selected are more significant, the characteristics of clustering results are also more significant, and the analysis or diagnosis is more credible. The indices and diagnosis analysis function produced by this study provide a necessary theoretical foundation and new ideas for the development of hydraulic metal structure health diagnosis technology.
文摘Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some major problems.One way to resolve these problems and improve the k-means algorithm is the use of evolutionary algorithms in clustering.In this study,the Imperialist Competitive Algorithm(ICA) is developed and then used in the clustering process.Clustering of IRIS,Wine and CMC datasets using developed ICA and comparing them with the results of clustering by the original ICA,GA and PSO algorithms,demonstrate the improvement of Imperialist competitive algorithm.
基金funded by the Security BigData Fusion Project(Office of theMinistry of Higher Education,Science,Research and Innovation).The corresponding author is the project PI.
文摘Attempts to determine characters of astronomical objects have been one of major and vibrant activities in both astronomy and data science fields.Instead of a manual inspection,various automated systems are invented to satisfy the need,including the classification of light curve profiles.A specific Kaggle competition,namely Photometric LSST Astronomical Time-Series Classification Challenge(PLAsTiCC),is launched to gather new ideas of tackling the abovementioned task using the data set collected from the Large Synoptic Survey Telescope(LSST)project.Almost all proposed methods fall into the supervised family with a common aim to categorize each object into one of pre-defined types.As this challenge focuses on developing a predictive model that is robust to classifying unseen data,those previous attempts similarly encounter the lack of discriminate features,since distribution of training and actual test datasets are largely different.As a result,well-known classification algorithms prove to be sub-optimal,while more complicated feature extraction techniques may help to slightly boost the predictive performance.Given such a burden,this research is set to explore an unsupervised alternative to the difficult quest,where common classifiers fail to reach the 50%accuracy mark.A clustering technique is exploited to transform the space of training data,from which a more accurate classifier can be built.In addition to a single clustering framework that provides a comparable accuracy to the front runners of supervised learning,a multiple-clustering alternative is also introduced with improved performance.In fact,it is able to yield a higher accuracy rate of 58.32%from 51.36%that is obtained using a simple clustering.For this difficult problem,it is rather good considering for those achieved by well-known models like support vector machine(SVM)with 51.80%and Naive Bayes(NB)with only 2.92%.