The dimensionality of data is increasing very rapidly,which creates challenges for most of the current mining and learning algorithms,such as large memory requirements and high computational costs.The literature inclu...The dimensionality of data is increasing very rapidly,which creates challenges for most of the current mining and learning algorithms,such as large memory requirements and high computational costs.The literature includes much research on feature selection for supervised learning.However,feature selection for unsupervised learning has only recently been studied.Finding the subset of features in unsupervised learning that enhances the performance is challenging since the clusters are indeterminate.This work proposes a hybrid technique for unsupervised feature selection called GAk-MEANS,which combines the genetic algorithm(GA)approach with the classical k-Means algorithm.In the proposed algorithm,a new fitness func-tion is designed in addition to new smart crossover and mutation operators.The effectiveness of this algorithm is demonstrated on various datasets.Fur-thermore,the performance of GAk-MEANS has been compared with other genetic algorithms,such as the genetic algorithm using the Sammon Error Function and the genetic algorithm using the Sum of Squared Error Function.Additionally,the performance of GAk-MEANS is compared with the state-of-the-art statistical unsupervised feature selection techniques.Experimental results show that GAk-MEANS consistently selects subsets of features that result in better classification accuracy compared to others.In particular,GAk-MEANS is able to significantly reduce the size of the subset of selected features by an average of 86.35%(72%–96.14%),which leads to an increase of the accuracy by an average of 3.78%(1.05%–6.32%)compared to using all features.When compared with the genetic algorithm using the Sammon Error Function,GAk-MEANS is able to reduce the size of the subset of selected features by 41.29%on average,improve the accuracy by 5.37%,and reduce the time by 70.71%.When compared with the genetic algorithm using the Sum of Squared Error Function,GAk-MEANS on average is able to reduce the size of the subset of selected features by 15.91%,and improve the accuracy by 9.81%,but the time is increased by a factor of 3.When compared with the machine-learning based methods,we observed that GAk-MEANS is able to increase the accuracy by 13.67%on average with an 88.76%average increase in time.展开更多
In this letter, a new method is proposed for unsupervised classification of terrain types and man-made objects using POLarimetric Synthetic Aperture Radar (POLSAR) data. This technique is a combi-nation of the usage o...In this letter, a new method is proposed for unsupervised classification of terrain types and man-made objects using POLarimetric Synthetic Aperture Radar (POLSAR) data. This technique is a combi-nation of the usage of polarimetric information of SAR images and the unsupervised classification method based on fuzzy set theory. Image quantization and image enhancement are used to preprocess the POLSAR data. Then the polarimetric information and Fuzzy C-Means (FCM) clustering algorithm are used to classify the preprocessed images. The advantages of this algorithm are the automated classification, its high classifica-tion accuracy, fast convergence and high stability. The effectiveness of this algorithm is demonstrated by ex-periments using SIR-C/X-SAR (Spaceborne Imaging Radar-C/X-band Synthetic Aperture Radar) data.展开更多
This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first ...This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first split into eight typographical categories. The classification scheme uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. The fuzzy unsupervised character classification, which is natural in the repre...展开更多
As a classic NP-hard problem in machine learning and computational geometry,the k-means problem aims to partition the given dataset into k clusters according to the minimal squared Euclidean distance.Different from k-...As a classic NP-hard problem in machine learning and computational geometry,the k-means problem aims to partition the given dataset into k clusters according to the minimal squared Euclidean distance.Different from k-means problem and most of its variants,fuzzy k-means problem belongs to the soft clustering problem,where each given data point has relationship to every center point.Compared to fuzzy k-means problem,fuzzy k-means problem with penalties allows that some data points need not be clustered instead of being paid penalties.In this paper,we propose an O(αk In k)-approximation algorithm based on seeding algorithm for fuzzy k-means problem with penalties,whereαinvolves the ratio of the maximal penalty value to the minimal one.Furthermore,we implement numerical experiments to show the effectiveness of our algorithm.展开更多
With a sharp increase in the information volume,analyzing and retrieving this vast data volume is much more essential than ever.One of the main techniques that would be beneficial in this regard is called the Clusteri...With a sharp increase in the information volume,analyzing and retrieving this vast data volume is much more essential than ever.One of the main techniques that would be beneficial in this regard is called the Clustering method.Clustering aims to classify objects so that all objects within a cluster have similar features while other objects in different clusters are as distinct as possible.One of the most widely used clustering algorithms with the well and approved performance in different applications is the k-means algorithm.The main problem of the k-means algorithm is its performance which can be directly affected by the selection in the primary clusters.Lack of attention to this crucial issue has consequences such as creating empty clusters and decreasing the convergence time.Besides,the selection of appropriate initial seeds can reduce the cluster’s inconsistency.In this paper,we present a new method to determine the initial seeds of the k-mean algorithm to improve the accuracy and decrease the number of iterations of the algorithm.For this purpose,a new method is proposed considering the average distance between objects to determine the initial seeds.Our method attempts to provide a proper tradeoff between the accuracy and speed of the clustering algorithm.The experimental results showed that our proposed approach outperforms the Chithra with 1.7%and 2.1%in terms of clustering accuracy for Wine and Abalone detection data,respectively.Furthermore,achieved results indicate that comparing with the Reverse Nearest Neighbor(RNN)search approach,the proposed method has a higher convergence speed.展开更多
Classifying the data into a meaningful group is one of the fundamental ways of understanding and learning the valuable information. High-quality clustering methods are necessary for the valuable and efficient analysis...Classifying the data into a meaningful group is one of the fundamental ways of understanding and learning the valuable information. High-quality clustering methods are necessary for the valuable and efficient analysis of the increasing data. The Firefly Algorithm (FA) is one of the bio-inspired algorithms and it is recently used to solve the clustering problems. In this paper, Hybrid F-Firefly algorithm is developed by combining the Fuzzy C-Means (FCM) with FA to improve the clustering accuracy with global optimum solution. The Hybrid F-Firefly algorithm is developed by incorporating FCM operator at the end of each iteration in FA algorithm. This proposed algorithm is designed to utilize the goodness of existing algorithm and to enhance the original FA algorithm by solving the shortcomings in the FCM algorithm like the trapping in local optima and sensitive to initial seed points. In this research work, the Hybrid F-Firefly algorithm is implemented and experimentally tested for various performance measures under six different benchmark datasets. From the experimental results, it is observed that the Hybrid F-Firefly algorithm significantly improves the intra-cluster distance when compared with the existing algorithms like K-means, FCM and FA algorithm.展开更多
Data clustering is crucial when it comes to data processing and analytics.The new clustering method overcomes the challenge of evaluating and extracting data from big data.Numerical or categorical data can be grouped....Data clustering is crucial when it comes to data processing and analytics.The new clustering method overcomes the challenge of evaluating and extracting data from big data.Numerical or categorical data can be grouped.Existing clustering methods favor numerical data clustering and ignore categorical data clustering.Until recently,the only way to cluster categorical data was to convert it to a numeric representation and then cluster it using current numeric clustering methods.However,these algorithms could not use the concept of categorical data for clustering.Following that,suggestions for expanding traditional categorical data processing methods were made.In addition to expansions,several new clustering methods and extensions have been proposed in recent years.ROCK is an adaptable and straightforward algorithm for calculating the similarity between data sets to cluster them.This paper aims to modify the algo-rithm by creating a parameterized version that takes specific algorithm parameters as input and outputs satisfactory cluster structures.The parameterized ROCK algorithm is the name given to the modified algorithm(P-ROCK).The proposed modification makes the original algorithm moreflexible by using user-defined parameters.A detailed hypothesis was developed later validated with experimental results on real-world datasets using our proposed P-ROCK algorithm.A comparison with the original ROCK algorithm is also provided.Experiment results show that the proposed algorithm is on par with the original ROCK algorithm with an accuracy of 97.9%.The proposed P-ROCK algorithm has improved the runtime and is moreflexible and scalable.展开更多
A number of clustering algorithms were used to analyze many databases in the field of image clustering. The main objective of this research work was to perform a comparative analysis of the two of the existing partiti...A number of clustering algorithms were used to analyze many databases in the field of image clustering. The main objective of this research work was to perform a comparative analysis of the two of the existing partitions based clustering algorithms and a hybrid clustering algorithm. The results verification done by using classification algorithms via its accuracy. The perfor-mance of clustering and classification algorithms were carried out in this work based on the tumor identification, cluster quality and other parameters like run time and volume complexity. Some of the well known classification algorithms were used to find the accuracy of produced results of the clustering algorithms. The performance of the clustering algorithms proved mean-ingful in many domains, particularly k-Means, FCM. In addition, the proposed multifarious clustering technique has revealed their efficiency in terms of performance in predicting tumor affected regions in mammogram images. The color images are converted in to gray scale images and then it is processed. Finally, it is identified the best method for the analysis of finding tumor in breast images. This research would be immensely useful to physicians and radiologist to identify cancer affected area in the breast.展开更多
目前,大多数特征选择算法是针对完整数据集的.而面对缺失及无标签数据集时,多数特征选择算法是无效的.为了解决缺失及无标签数据集的特征选择问题,本文提出了一种基于加权FCM,融合互信息同时交替更新特征权重的ReliefF算法(WFCM-IRelief...目前,大多数特征选择算法是针对完整数据集的.而面对缺失及无标签数据集时,多数特征选择算法是无效的.为了解决缺失及无标签数据集的特征选择问题,本文提出了一种基于加权FCM,融合互信息同时交替更新特征权重的ReliefF算法(WFCM-IReliefF,Improved ReliefF Based on WFCM).首先,对均值预填补的完整数据集利用FCM算法进行无监督学习,从而找到样本近邻;其次,将ReliefF算法计算得到的特征权重代入加权FCM算法中,解决原始空间与特征空间的不同造成的聚类效果不佳的问题,通过加权FCM算法和ReliefF算法交替更新得到关键特征;再者,对特征选择后的数据集利用矩阵分解技术改善对缺失数据的预填补.最后,利用多个UCI公共数据集的对比实验,验证了本文提出的算法与其他对比算法相比有较为满意的效果.展开更多
文摘The dimensionality of data is increasing very rapidly,which creates challenges for most of the current mining and learning algorithms,such as large memory requirements and high computational costs.The literature includes much research on feature selection for supervised learning.However,feature selection for unsupervised learning has only recently been studied.Finding the subset of features in unsupervised learning that enhances the performance is challenging since the clusters are indeterminate.This work proposes a hybrid technique for unsupervised feature selection called GAk-MEANS,which combines the genetic algorithm(GA)approach with the classical k-Means algorithm.In the proposed algorithm,a new fitness func-tion is designed in addition to new smart crossover and mutation operators.The effectiveness of this algorithm is demonstrated on various datasets.Fur-thermore,the performance of GAk-MEANS has been compared with other genetic algorithms,such as the genetic algorithm using the Sammon Error Function and the genetic algorithm using the Sum of Squared Error Function.Additionally,the performance of GAk-MEANS is compared with the state-of-the-art statistical unsupervised feature selection techniques.Experimental results show that GAk-MEANS consistently selects subsets of features that result in better classification accuracy compared to others.In particular,GAk-MEANS is able to significantly reduce the size of the subset of selected features by an average of 86.35%(72%–96.14%),which leads to an increase of the accuracy by an average of 3.78%(1.05%–6.32%)compared to using all features.When compared with the genetic algorithm using the Sammon Error Function,GAk-MEANS is able to reduce the size of the subset of selected features by 41.29%on average,improve the accuracy by 5.37%,and reduce the time by 70.71%.When compared with the genetic algorithm using the Sum of Squared Error Function,GAk-MEANS on average is able to reduce the size of the subset of selected features by 15.91%,and improve the accuracy by 9.81%,but the time is increased by a factor of 3.When compared with the machine-learning based methods,we observed that GAk-MEANS is able to increase the accuracy by 13.67%on average with an 88.76%average increase in time.
基金Supported by the University Doctorate Special Research Fund (No. 20030614001) and the Youth Scholarship Leader Fund of Univ. of Electro. Sci. and Tech. of China.
文摘In this letter, a new method is proposed for unsupervised classification of terrain types and man-made objects using POLarimetric Synthetic Aperture Radar (POLSAR) data. This technique is a combi-nation of the usage of polarimetric information of SAR images and the unsupervised classification method based on fuzzy set theory. Image quantization and image enhancement are used to preprocess the POLSAR data. Then the polarimetric information and Fuzzy C-Means (FCM) clustering algorithm are used to classify the preprocessed images. The advantages of this algorithm are the automated classification, its high classifica-tion accuracy, fast convergence and high stability. The effectiveness of this algorithm is demonstrated by ex-periments using SIR-C/X-SAR (Spaceborne Imaging Radar-C/X-band Synthetic Aperture Radar) data.
文摘This paper presents a fuzzy logic approach to efficiently perform unsupervised character classification for improvement in robustness, correctness and speed of a character recognition system. The characters are first split into eight typographical categories. The classification scheme uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. The fuzzy unsupervised character classification, which is natural in the repre...
基金Higher Educational Science and Technology Program of Shandong Province(No.J17KA171)Natural Science Foundation of Shandong Province(No.ZR2020MA029).
文摘As a classic NP-hard problem in machine learning and computational geometry,the k-means problem aims to partition the given dataset into k clusters according to the minimal squared Euclidean distance.Different from k-means problem and most of its variants,fuzzy k-means problem belongs to the soft clustering problem,where each given data point has relationship to every center point.Compared to fuzzy k-means problem,fuzzy k-means problem with penalties allows that some data points need not be clustered instead of being paid penalties.In this paper,we propose an O(αk In k)-approximation algorithm based on seeding algorithm for fuzzy k-means problem with penalties,whereαinvolves the ratio of the maximal penalty value to the minimal one.Furthermore,we implement numerical experiments to show the effectiveness of our algorithm.
文摘With a sharp increase in the information volume,analyzing and retrieving this vast data volume is much more essential than ever.One of the main techniques that would be beneficial in this regard is called the Clustering method.Clustering aims to classify objects so that all objects within a cluster have similar features while other objects in different clusters are as distinct as possible.One of the most widely used clustering algorithms with the well and approved performance in different applications is the k-means algorithm.The main problem of the k-means algorithm is its performance which can be directly affected by the selection in the primary clusters.Lack of attention to this crucial issue has consequences such as creating empty clusters and decreasing the convergence time.Besides,the selection of appropriate initial seeds can reduce the cluster’s inconsistency.In this paper,we present a new method to determine the initial seeds of the k-mean algorithm to improve the accuracy and decrease the number of iterations of the algorithm.For this purpose,a new method is proposed considering the average distance between objects to determine the initial seeds.Our method attempts to provide a proper tradeoff between the accuracy and speed of the clustering algorithm.The experimental results showed that our proposed approach outperforms the Chithra with 1.7%and 2.1%in terms of clustering accuracy for Wine and Abalone detection data,respectively.Furthermore,achieved results indicate that comparing with the Reverse Nearest Neighbor(RNN)search approach,the proposed method has a higher convergence speed.
文摘Classifying the data into a meaningful group is one of the fundamental ways of understanding and learning the valuable information. High-quality clustering methods are necessary for the valuable and efficient analysis of the increasing data. The Firefly Algorithm (FA) is one of the bio-inspired algorithms and it is recently used to solve the clustering problems. In this paper, Hybrid F-Firefly algorithm is developed by combining the Fuzzy C-Means (FCM) with FA to improve the clustering accuracy with global optimum solution. The Hybrid F-Firefly algorithm is developed by incorporating FCM operator at the end of each iteration in FA algorithm. This proposed algorithm is designed to utilize the goodness of existing algorithm and to enhance the original FA algorithm by solving the shortcomings in the FCM algorithm like the trapping in local optima and sensitive to initial seed points. In this research work, the Hybrid F-Firefly algorithm is implemented and experimentally tested for various performance measures under six different benchmark datasets. From the experimental results, it is observed that the Hybrid F-Firefly algorithm significantly improves the intra-cluster distance when compared with the existing algorithms like K-means, FCM and FA algorithm.
基金supporting project number(RSP2022R498),King Saud University,Riyadh,Saudi Arabia.
文摘Data clustering is crucial when it comes to data processing and analytics.The new clustering method overcomes the challenge of evaluating and extracting data from big data.Numerical or categorical data can be grouped.Existing clustering methods favor numerical data clustering and ignore categorical data clustering.Until recently,the only way to cluster categorical data was to convert it to a numeric representation and then cluster it using current numeric clustering methods.However,these algorithms could not use the concept of categorical data for clustering.Following that,suggestions for expanding traditional categorical data processing methods were made.In addition to expansions,several new clustering methods and extensions have been proposed in recent years.ROCK is an adaptable and straightforward algorithm for calculating the similarity between data sets to cluster them.This paper aims to modify the algo-rithm by creating a parameterized version that takes specific algorithm parameters as input and outputs satisfactory cluster structures.The parameterized ROCK algorithm is the name given to the modified algorithm(P-ROCK).The proposed modification makes the original algorithm moreflexible by using user-defined parameters.A detailed hypothesis was developed later validated with experimental results on real-world datasets using our proposed P-ROCK algorithm.A comparison with the original ROCK algorithm is also provided.Experiment results show that the proposed algorithm is on par with the original ROCK algorithm with an accuracy of 97.9%.The proposed P-ROCK algorithm has improved the runtime and is moreflexible and scalable.
文摘A number of clustering algorithms were used to analyze many databases in the field of image clustering. The main objective of this research work was to perform a comparative analysis of the two of the existing partitions based clustering algorithms and a hybrid clustering algorithm. The results verification done by using classification algorithms via its accuracy. The perfor-mance of clustering and classification algorithms were carried out in this work based on the tumor identification, cluster quality and other parameters like run time and volume complexity. Some of the well known classification algorithms were used to find the accuracy of produced results of the clustering algorithms. The performance of the clustering algorithms proved mean-ingful in many domains, particularly k-Means, FCM. In addition, the proposed multifarious clustering technique has revealed their efficiency in terms of performance in predicting tumor affected regions in mammogram images. The color images are converted in to gray scale images and then it is processed. Finally, it is identified the best method for the analysis of finding tumor in breast images. This research would be immensely useful to physicians and radiologist to identify cancer affected area in the breast.
文摘目前,大多数特征选择算法是针对完整数据集的.而面对缺失及无标签数据集时,多数特征选择算法是无效的.为了解决缺失及无标签数据集的特征选择问题,本文提出了一种基于加权FCM,融合互信息同时交替更新特征权重的ReliefF算法(WFCM-IReliefF,Improved ReliefF Based on WFCM).首先,对均值预填补的完整数据集利用FCM算法进行无监督学习,从而找到样本近邻;其次,将ReliefF算法计算得到的特征权重代入加权FCM算法中,解决原始空间与特征空间的不同造成的聚类效果不佳的问题,通过加权FCM算法和ReliefF算法交替更新得到关键特征;再者,对特征选择后的数据集利用矩阵分解技术改善对缺失数据的预填补.最后,利用多个UCI公共数据集的对比实验,验证了本文提出的算法与其他对比算法相比有较为满意的效果.